This is an archive of the discontinued LLVM Phabricator instance.

[Lex] Allow to consume tokens while preprocessing
ClosedPublic

Authored by ilya-biryukov on Mar 27 2019, 9:20 AM.

Details

Summary

By adding a hook to consume all tokens produced by the preprocessor.
The intention of this change is to make it possible to consume the
expanded tokens without re-runnig the preprocessor with minimal changes
to the preprocessor and minimal performance penalty when preprocessing
without recording the tokens.

The added hook is very low-level and reconstructing the expanded token
stream requires more work in the client code, the actual algorithm to
collect the tokens using this hook can be found in the follow-up change.

Diff Detail

Repository
rC Clang

Event Timeline

ilya-biryukov created this revision.Mar 27 2019, 9:20 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 27 2019, 9:20 AM

I'm happy to assess the performance costs of this change if needed (the function is obviously on the hot path), but I hope this won't add any significant performance costs.
Also happy to consider alternative approaches to collect the tokens when preprocessing, the current implementation is complicated (search for 'TokenCollector' in D59887) and my intuition is that moving parts of it to the preprocessor should simplify things considerably.

Pass an enum indicating where the token comes from.
The enum is ad-hoc at the moment, will require some thought to turn
it into a reasonable abstraction.

The consumer of the token stream actually needs to be able to distinguish
tokens which are part of the final expanded token stream vs those that
aren't (macro directives, intermediate macro argument expansions, etc)

ilya-biryukov added inline comments.Apr 16 2019, 7:01 AM
clang/include/clang/Lex/Preprocessor.h
120 ↗(On Diff #193901)

This is supposed to provide enough information to distinguish token from an intermediate macro expansion from a token of a final expanded stream.

It's not very well thought-through, though, the final set of enum values will definitely end up being different.

  • Revamp TokenSource, make it more principled

I'd like to understand more about the intended use cases of this functionality. What information do clients want?

clang/lib/Lex/Preprocessor.cpp
870–900 ↗(On Diff #197278)

This is a lot of extra stuff to be doing in the main Lex loop. Adding one (perfectly-predicted) branch on OnToken seems like it should be fine, but this seems like a bit more added complexity than I'd prefer. I'd like some performance measurements of -cc1 -Eonly to see if this makes any real difference.

896 ↗(On Diff #197278)

This isn't a directive; these are normal tokens.

956–957 ↗(On Diff #197278)

This seems like it's going to be extremely hard to use. If you just want the expanded token stream, then removing all the other calls to OnToken and only calling it here when LexLevel == 0 will give you that.

ilya-biryukov marked an inline comment as done.May 2 2019, 4:24 AM

I'd like to understand more about the intended use cases of this functionality. What information do clients want?

We are aiming to collect all expanded tokens, all raw tokens and how they relate to each other (i.e. which raw tokens form macro calls, #include directives, etc that produced particular subrange of expanded tokens).
We go through all this trouble to allow mapping expanded tokens (which is what you "see" in the AST, even though AST does not store them) back to the raw tokens that produced them (which you can finally map to textual offsets, the final output in various tooling-related scenarios).
That information is used specifically to map (an allowed subset of) changes in the expanded token stream to textual changes.

We could actually get most of this information merely by looking at source locations. The only bit that we're missing is how to map from a macro call tokens into all expanded tokens that were produced by it.
I'll play around with reporting only the expanded token stream here and using source locations to produce the rest of the information we need.

clang/lib/Lex/Preprocessor.cpp
870–900 ↗(On Diff #197278)

Happy to do the measurements. Do you have a good collection of input files for this in mind?

896 ↗(On Diff #197278)

Sorry if it's a silly question, just wanted to clarify I'm not missing anything here.

These tokens are the name of the module, "import-suffix" and the semicolon that follows it?
And they end up being used to build the AST for ImportDecl?

ilya-biryukov marked an inline comment as done.May 2 2019, 7:04 AM
ilya-biryukov added inline comments.
clang/lib/Lex/Preprocessor.cpp
956–957 ↗(On Diff #197278)

I've tried LexLevel == 0 && IsNewToken and it almost works.
The only trouble left is dealing with the delayed parsing. E.g. I'm getting the callbacks for the same tokens multiple times in cases like parsing of method bodies.

Any ideas on how to best tackle this?

The example is:

struct X {
  int method() { return 10; }
  int a = 10;
};

Seeing multiple callbacks for { return 10; } and = 10;, want to see only the first one.

  • Simplify the interface, get only the expanded token stream
  • An attempt to not report tokens from delayed parsing twice, almost works
ilya-biryukov marked an inline comment as not done.May 2 2019, 8:16 AM
ilya-biryukov added inline comments.
clang/lib/Lex/Preprocessor.cpp
956–957 ↗(On Diff #197278)

It almost works now, see the update revision.
Almost no tokens get reported by the callback twice now, except the following case.
Input:

struct X {
  int method() { return method(); }
};

After the body of the class, we get a an extra callback for ) token from one of the functions under ParsedLexedMethodDef.
This seems to be happening only for parenthesis.

ilya-biryukov marked an inline comment as not done.May 2 2019, 10:13 AM
ilya-biryukov added inline comments.
clang/lib/Lex/Preprocessor.cpp
956–957 ↗(On Diff #197278)

More precisely, this happens from a following sequence of calls:

Lex() {LexLevel=0}
CachingLex() {LexLevel=1} // which does not have a token in `CachedTokens`, so it recurses into ...
Lex() {LexLevel=1}        // which results in a call to ...
CurTokenLexer->Lex()

At this point CurTokenLexer returns a token of a pre-lexed method definition, but the current implementation forgets that this was the case by the time the token is processed at the the top of the callstack (Lex() {LexLevel=0}).
So the token ends up being reported by a callback.

  • Propagate whether the token came from a token stream through CachingLex.

I've managed to get what I needed by returning a flag from Lex.
@rsmith, could you take a look? Is there a better way to do the equivalent?

  • Get rid of an accidental change in how LexAfterModuleImport is handled
rsmith added inline comments.May 8 2019, 11:41 AM
clang/lib/Lex/Preprocessor.cpp
870–900 ↗(On Diff #197278)

I think any large header should be fine for testing; I don't expect this to scale worse in the presence of lots of macro expansion or anything like that.

896 ↗(On Diff #197278)

Right. LexAfterModuleImport is a pass-through lexing layer that takes special action on the ObjC @import module.name; and C++20 import "header"; declarations, which are just normal phase 4 tokens.

956–957 ↗(On Diff #197278)

Yeah, the parser messes with the token stream to support C++'s out-of-order parsing rule (among other things). Ignoring that here seems right. You might also want to make sure that calls to EnterToken are handled properly. The extra paren likely comes from such a call.

ilya-biryukov added a comment.EditedMay 9 2019, 8:47 AM

Here are some numbers from running clang -cc1 -Eonly on SemaExpr.cpp, it includes a whole ton of headers, including a rather large TreeTransform.h. This was run on my machine, so keep in mind it's rather noisy. Minimal times should be somewhat representative, though.

Before:
  Time (mean ± σ):     159.5 ms ±   7.3 ms    [User: 137.3 ms, System: 22.1 ms]
  Range (min … max):   148.9 ms … 199.5 ms    100 runs

After (no callback specified):
  Time (mean ± σ):     165.0 ms ±   7.5 ms    [User: 142.8 ms, System: 22.2 ms]
  Range (min … max):   155.0 ms … 196.7 ms    100 runs

And two more measurements with callbacks non-empty callbacks:

Counting tokens:
  Time (mean ± σ):     169.0 ms ±   9.5 ms    [User: 147.3 ms, System: 21.7 ms]
  Range (min … max):   158.3 ms … 206.0 ms    100 runs

Saving tokens to a vector:
  Time (mean ± σ):     191.0 ms ±   5.5 ms    [User: 154.1 ms, System: 36.8 ms]
  Range (min … max):   180.4 ms … 221.0 ms    100 runs

Overall, the performance cost in empty-callback case seems to be lower than 5%. This is significant, but hopefully acceptable.

Overall, the performance cost in empty-callback case seems to be lower than 5%. This is significant, but hopefully acceptable.

5% is a lot to pay for something that we won't be using at all during normal preprocessing / compilation. But I think we can simplify this a bit, and that might help.

clang/lib/Lex/PPCaching.cpp
64 ↗(On Diff #198663)

This change seems redundant: LexLevel is always 1 here, so this token would never be reported anyway. And with that gone I think you can remove the Report parameter entirely.

clang/lib/Lex/Preprocessor.cpp
892 ↗(On Diff #198663)

Doesn't this always set Report to the same value as IsNewToken? (The only case we set Report to false is when we call CachingLex and it sets IsNewToken to false, and CachingLex can't be recursively called twice, so its recursive call to Lex can't set Report to false..)

ilya-biryukov marked 2 inline comments as done.May 13 2019, 3:32 AM

I'll try to explore bringing the overhead down.
The fact that CachingLex is happening at LexLevel==1 might help, thanks for pointing that out!

clang/lib/Lex/PPCaching.cpp
64 ↗(On Diff #198663)

It is reported by the caller of CachingLex (the one that has LexLevel == 0) and it needs to know where the reported token comes from.

clang/lib/Lex/Preprocessor.cpp
892 ↗(On Diff #198663)

We can have IsNewToken = true && Report = false when the recursive call to Lex() inside CachingLex() sets Report = CurTokenLexer->isMacroExpansion().
This happens when we call CachingLex and the resulting token comes from a token stream for out-of-order parsing.

Thinking about this some more: distinguishing between "macro expansion" and other cases seems like a proxy for "this token came from inside the preprocessor / lexer" versus "this token was provided by the user", which is also exactly what IsNewToken is supposed to capture. I don't think we need two separate flags here.

I also suspect that a significant amount of the cost is the additional function call on the hot path (the 1-parameter Lex wrapper function). I wonder if there's something that we could do to remove that from the hot path without duplicating the whole Lex function. Maybe we could add a flag to Tokens to say that they're a replay of prior tokens, rather than being newly-created tokens, and make TokenLexer set that flag on tokens it produces if its token sequence came from a "replay" context (that's largely the same as your current "not a macro expansion" check, but there are probably cases where the preprocessor itself generates tokens for which that's wrong; we should have a flag on EnterTokenStream to indicate which mode we're in). We can then also change CachingLex to set the flag, and get rid of the IsNewToken flag too.

The suggested approach looks promising. The difference seems to be within the noise levels on my machine:

Before the change (baseline upstream revision):
  Time (mean ± σ):     159.1 ms ±   8.7 ms    [User: 138.3 ms, System: 20.6 ms]
  Range (min … max):   150.4 ms … 196.2 ms    100 runs

After (no callback specified):
  Time (mean ± σ):     160.4 ms ±   7.6 ms    [User: 138.5 ms, System: 21.7 ms]
  Range (min … max):   151.0 ms … 191.5 ms    100 runs

I'm sending out a prototype I used for measures for review. The flag is currently set by the preprocessor, but I guess it would make more sense to do the following before landing this:

  • flip the flag, i.e. instead of reporting whether the token is "new" (the common case), report whether it's "cached", i.e. coming from CachedTokens or from a non-macro-expansion token stream.
  • set this flag only inside CachingLex and TokenLexer.

The revision also removes IsNewToken from CachingLex and sets the flag in the token instead.
Any other suggestions?

  • Use a flag inside clang::Token instead
  • Remove the now redundant 'public:' spec.

The suggested approach looks promising. The difference seems to be within the noise levels on my machine:

:) Awesome!

I guess it would make more sense to do the following before landing this:

  • flip the flag, i.e. instead of reporting whether the token is "new" (the common case), report whether it's "cached", i.e. coming from CachedTokens or from a non-macro-expansion token stream.
  • set this flag only inside CachingLex and TokenLexer.

The revision also removes IsNewToken from CachingLex and sets the flag in the token instead.
Any other suggestions?

Notionally, I think we should separate out "produces new tokens" from "is macro expansion" in TokenLexer. Please take a quick look at the various callers of EnterTokenStream (particularly the ones inside Lex; I'm not too worried about the ones in Parse) and see which (if any) of them actually intend to inject new "real" phase-4-of-translation tokens. If there are any (and maybe even if there aren't?), please add a flag on EnterTokenStream to specify whether the TokenLexer believes it's creating new tokens or replaying tokens that have been seen before.

We'll need to decide what to do about the annotation tokens that we create when we enter / leave / import modules (Preprocessor::EnterAnnotationToken). In some sense, those are "new" tokens -- and that's how the parser will interpret them -- but they didn't come from the original source file. I think either answer could be OK there, but I'm inclined to treat them as "new" because that gives more information to the token consumer.

Will do. Also think reporting annotation tokens is ok, one can easily tell them apart and filter them by looking at the corresponding flags in clang::Token, will put it into the docs, though

I've added a new parameter to EnterToken and EnterTokenStream. There are a bunch of interesting cases.

Preprocessor::AnnotateCachedTokens is one. It consumes some CachedTokens (which are always considered re-injected) and replaces them with a single annotation token (which we normally don't consider re-injected).
So we end up with an annotation token that is re-injected (and therefore never reported). That's ok for our current use-cases, so I won't try to fix it in this change. If this actually turns out important at any moment, can fix in a follow-up.

Another interesting case is parser "tweaking" tokens and re-introducing them, e.g. FixDigraph that turns a digraph token into two or various recoveries turning :: into : or introducing ephemeral ; and } tokens.
I've decided to mark those as re-injected for the time being, that is fine for now and I'm happy to reconsider later if it turns out we need them.

  • Properly mark tokens as reinjected where necessary.
ilya-biryukov marked an inline comment as done.
  • Improve comment, remove redundant braces
clang/lib/Lex/Preprocessor.cpp
955 ↗(On Diff #199637)

Could probably remove the IsReinjected check from here. It's fine to check this on the client side.

rsmith added inline comments.May 15 2019, 11:30 AM
clang/lib/Lex/PPDirectives.cpp
1521 ↗(On Diff #199639)

I think it'd be more useful to treat this as a new token. But that's not a strong preference.

clang/lib/Lex/Pragma.cpp
370 ↗(On Diff #199639)

I think this case is a reinjection; we've copied some tokens inside __pragma out into a duplicate position in the token stream. But I guess it doesn't matter because the tokens never escape the outer Lex function anyway.

clang/lib/Lex/Preprocessor.cpp
1136 ↗(On Diff #199639)

I think this should always be false: the tokens we're producing here have never been emitted at LexLevel 0 before.

ilya-biryukov marked 3 inline comments as done.
  • Address comments
ilya-biryukov added inline comments.May 16 2019, 2:51 AM
clang/lib/Lex/PPDirectives.cpp
1521 ↗(On Diff #199639)

Ah, there were too many changes and I missed this one. Definitely agree, clients can filter out annotation tokens themselves, if needed.

clang/lib/Lex/Pragma.cpp
370 ↗(On Diff #199639)

Yeah, my logic is that it's not a re-injection in the sense that they were never the phase 4 tokens before.

clang/lib/Lex/Preprocessor.cpp
1136 ↗(On Diff #199639)

Ah, totally, we lexed the original tokens in this function...
Fixed, thanks!

rsmith accepted this revision.May 16 2019, 11:41 AM
rsmith marked an inline comment as done.

Thanks, LGTM!

clang/lib/Lex/Pragma.cpp
370 ↗(On Diff #199639)

Yeah, I think you're right. As a weird example, if we had a

#pragma clang produce_token x

pragma that just produced the given token, it'd be reasonable for it to leave the Reinjected flag alone, and it should come out as false in this case.

This revision is now accepted and ready to land.May 16 2019, 11:41 AM
  • Fix compilation of a clang-tidy check, add a FIXME to stop reporting those tokens
This revision was automatically updated to reflect the committed changes.