This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
include/clang/Tooling/Syntax/
-
clang/
-
Tooling/
-
Syntax/
29/32
Tokens.h
-
lib/Tooling/
-
Tooling/
-
CMakeLists.txt
-
Syntax/
-
CMakeLists.txt
17/26
Tokens.cpp
-
unittests/Tooling/
-
Tooling/
-
CMakeLists.txt
-
Syntax/
-
CMakeLists.txt
1/4
TokensTest.cpp

Differential D59887

[Syntax] Introduce TokenBuffer, start clangToolingSyntax library
ClosedPublic

Authored by ilya-biryukov on Mar 27 2019, 9:21 AM.

Download Raw Diff

Details

Reviewers

gribozavr
sammccall

Commits

rZORGd810bb71eeff: [Syntax] Introduce TokenBuffer, start clangToolingSyntax library
rGd810bb71eeff: [Syntax] Introduce TokenBuffer, start clangToolingSyntax library
rGddd5d5dbc8dd: [Syntax] Introduce TokenBuffer, start clangToolingSyntax library
rL361148: [Syntax] Introduce TokenBuffer, start clangToolingSyntax library
rC361148: [Syntax] Introduce TokenBuffer, start clangToolingSyntax library

Summary

TokenBuffer stores the list of tokens for a file obtained after
preprocessing. This is a base building block for syntax trees,
see [1] for the full proposal on syntax trees.

This commits also starts a new sub-library of ClangTooling, which
would be the home for the syntax trees and syntax-tree-based refactoring
utilities.

[1]: https://lists.llvm.org/pipermail/cfe-dev/2019-February/061414.html

Diff Detail

Repository

rG LLVM Github Monorepo

Build Status

Buildable 32062
Build 32061: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Still have a few comments to address in TokenCollector and wrt to naming. But apart from this revision is ready for another round.

clang/include/clang/Tooling/Syntax/TokenBuffer.h
120 ↗	(On Diff #192661)	Done. Note that we never map to tokens inside macro definition (the `s1 s2` example)
191 ↗	(On Diff #192661)	Of course a function called "consume" consumes the result :) Agreed :-)
clang/lib/Tooling/Syntax/TokenBuffer.cpp
295 ↗	(On Diff #192456)	Went with `BeginExpandedToken`, `EndExpandedToken`.
clang/unittests/Tooling/Syntax/TokenBufferTest.cpp
429 ↗	(On Diff #192661)	Thanks. Much nicer with a function that finds by text.
190 ↗	(On Diff #192456)	The declarations of `Actual` and `Expected` are really close, both types are easy to infer

ilya-biryukov added a reviewer: sammccall.Mar 29 2019, 9:07 AM

incomplete, haven't reviewed token collector

clang/include/clang/Tooling/Syntax/TokenBuffer.h
1 ↗	(On Diff #192839)	are you sure TokenBuffer is the central concept in this file, rather than just the thing with the most code? Token.h might end up being a better name for users.
8 ↗	(On Diff #192839)	file comment? sometimes they're covered by the class comments, but I think there's a bit to say here. in particular the logical model (how we model the preprocessor, the two token streams and types of mappings between them) might go here.
32 ↗	(On Diff #192839)	what's the default state (and why have one?) do you need a way to query whether a token is "valid"? (I'd avoid just relying on length() == 0 or location().isInvalid() because it's not obvious to callers this can happen)
58 ↗	(On Diff #192839)	unresonably -> unreasonably
60 ↗	(On Diff #192839)	If you're going to say "invocation" in the name, say "use" in the comment (or vice versa!)
74 ↗	(On Diff #192839)	It seems weirdly non-orthogonal to be able to get the range but not the file ID. I'd suggest either adding another accessor for that, or returning a `struct BufferRange { FileID File; unsigned Begin, End; }` (with a better name.) I favor the latter, because `.first` and `.second` don't offer the reader semantic hints at the callsite, and because that gives a nice place to document the half-open nature of the interval. Actually, I'd suggest adding such a struct accessor to Token too.
87 ↗	(On Diff #192839)	please rename along with macroTokens() function
92 ↗	(On Diff #192839)	Warning, braindump follows - let's discuss further. We talked a bunch offline about the logical and concrete data model here. As things stand: #includes are not expanded, but will refer to a file ID with its own buffer: `map<FileID, TokenBuffer>` is the whole structure no information is captured about other PP interactions (PP directives that generate no tokens, skipped sections the spelled sequence of tokens is not directly available (but expanded + macro invocations may be enough to reconstruct it) If we keep this model, I think we should spell both of these out in the comment. But there's another fairly natural model we could consider: we have a single TokenBuffer with a flat list of all expanded tokens in the TU, rather than for the file one FileID corresponds to a contiguous range of transitive symbols, these ranges nest the TokenBuffer also stores the original tokens as `map<FileID, vector<Token>>` spelled -> expanded token mapping: spelled tokens for a file are partitioned into ranges (types: literal, include, macro invocation, other pp directive, pp skipped region). each range maps onto a range of expanded tokens (empty for e.g. pp directive or skipped region) expanded -> spelled token is similar. (It's almost the inverse of the of the other mapping, we drop the "include" mapping ranges) This can naturally handle comments, preprocessor directives, pp skipped sections, etc - these are in the spelled stream but not the expanded stream. e.g. for this code // foo.cpp int a; #include "foo.h" int b = FOO(42); // foo.h #define FOO(X) XX int c; we'd have this buffer: expandedTokens = [int a ; int c ; int b = 42 42 ;] spelledTokens = { "foo.cpp" => [int a; # include "foo.h" int b = FOO ( 42 ) ;], "foo.h" => [# define FOO ( X ) X * X int c ;] } expandedToSpelling = { int => int (foo.cpp), type = literal a => a ; => ; [] => [# define FOO ( X ) X * X](foo.h), type=preprocessor directive int => int (foo.h) c => c ; => ; int => int (foo.cpp) b => b = => = [42 * 42] => [FOO ( 42 ) ](foo.cpp), type=macro invocation ; => ; (foo.cpp) } spellingToExpanded = { // foo.cpp int => int, type=literal a => a ; => ; [# include "foo.h"] => [int c ;], type=include int => int b => b = => = [FOO ( X )] => [42 * 42], type=macro invocation ; => ; // foo.h [# define FOO ( X ) X] => [], type=preprocessor directive int => int c => c ; => ; } Various implementation strategies possible here, one obvious one is to use a flat sorted list, and include a sequence of literal tokens as a single entry. struct ExpandedSpellingMapping { unsigned ExpandedStart, ExpandedEnd; FileID SpellingFile; // redundant for illustration: can actually derive from SpellingTokens[SpellingStart].location() unsigned SpellingStart, SpellingEnd; enum { LiteralSequence, MacroInvocation, Preprocessor, PPSkipped, Inclusion } Kind; } vector<ExpandedSpellingMapping> ExpandedToSpelling; // bsearchable vector<pair<FileID, ExpandedSpellingMapping>> Inclusions; // for spelling -> expanded mapping. redundant: much can be taken from SourceManager. A delta-encoded representation with chunks for bsearch will likely be much more compact, if this proves large. (Kind of similar to the compressed posting lists in clangd Dex) But as-is, the mappings for the example file would be: Expanded = { {0, 3, File0, 0, 3, LiteralSequence}, // int a; {3, 3, File1, 0, 8, Preprocessor}, // #define FOO(X) X * X {3, 6, File1, 8, 11, LiteralSequence}, // int c; {6, 9, File0, 6, 9, LiteralSequence}, // int b = {9, 12, File0, 9, 13, MacroExpansion}, // FOO(42) {12, 13, File0, 13, 14, LiteralSequence}, // ; } Inclusions = { {File1, {3, 6, File0, 3, 6, Inclusion}}, // #include }
99 ↗	(On Diff #192839)	expandedtokens
104 ↗	(On Diff #192839)	(this example is confusing, there is a 10 in the file!)
117 ↗	(On Diff #192839)	`, preprocessor directives are parsed but not interpreted` If using the model above, `this means that the spelled and expanded streams are identical`. (Or are we still going to strip comments?)
146 ↗	(On Diff #192839)	I'd think this would map a character range onto a character range, or a token range onto a token range - is there some reason otherwise?
171 ↗	(On Diff #192839)	not sure quite what this is for, but "tokens not in expanded stream" might include header-guarded #includes, comments, ifdef-'d sections, #defines... is this API intended to be a whitelist (macro invocations specifically) or a blacklist (everything that's not literally in the expanded stream)
72 ↗	(On Diff #192661)	I think your reply applies to TokenBuffer::macroTokens(), not MacroInvocation::macroTokens(). +1 to invocationTokens here.
74 ↗	(On Diff #192661)	I'd personally prefer invocationRange for symmetry with invocationTokens (which probably can't just be called tokens). But range is OK. Please document half-openness (shouldn't be necessary, but this is clang after all).

Will address the rest of the comments later, answering a few questions that popped up first.

clang/include/clang/Tooling/Syntax/TokenBuffer.h
1 ↗	(On Diff #192839)	I don't mind changing this to `Token.h`, although I'd personally expect that a file with this name only contains a definition for the token class. `Tokens.h` would be a better fit from my POV. WDYT?
92 ↗	(On Diff #192839)	Thanks for raising the multi-file issue earlier rather than later. My original intention was to model `TokenBuffer` as a stream of expanded tokens for a single FileID, i.e. we would have multiple `TokenBuffers` for multiple files. Your suggestion of having a single set of expanded tokens and a map from FileID to vectors of raw tokens (i.e. before preprocessing) looks very nice. A single expanded token stream is definitely a better model for the syntax trees (each node of a tree would cover a continuous subrange of the expanded token stream). In short, here are the highlights of the proposed model that I find most notable are: store a single stream of expanded tokens for a TU. store raw tokens for each file (mapped either by FileID or FileEntry). store information about preprocessor directives and macro replacements in each FileID (i.e. `MacroInvocation` in this patch) support an operation of mapping a subrange of expanded tokens into the subrange of raw-tokens in a particular FileID. This operation might fail. Let me know if I missed something.
146 ↗	(On Diff #192839)	No particular reason, this was driven by its usage in function computing replacements (not in this patch, see github prototype, if interested, but keep in mind the prototype is not ready for review yet). Mapping to a range of "raw" tokens seems more consistent with our model, will update accordingly, obtaining a range is easy after one has tokens from the file.
171 ↗	(On Diff #192839)	This was supposed to be all raw tokens in a file, which are not part of the expanded token stream, i.e. all PP directives, macro-replaced-tokens (object-like and function-like macros), tokens of skipped else branches, etc. In your proposed model we would store raw tokens of a file separately and this would not be necessary.
72 ↗	(On Diff #192661)	Yeah, sorry, I have mixed up these two functions with the same name. Will change to `invocationTokens`.

sammccall added inline comments.Apr 2 2019, 6:06 AM

clang/include/clang/Tooling/Syntax/TokenBuffer.h
1 ↗	(On Diff #192839)	Sure, Tokens.h SGTM.
66 ↗	(On Diff #192839)	I'd suggest a more natural order is token -> tokenbuffer -> macroinvocation. Forward declare? This is because the token buffer exposes token streams, which are probably the second-most important concept in the file (after tokens)
92 ↗	(On Diff #192839)	Yes, I think we're on the same page. Some more details: store a single stream of expanded tokens for a TU. BTW I don't think we're giving up the ability to tokenize only a subset of files. If we filter out files, we can just omit their spelled token stream and omit their tokens from the expanded stream. There's complexity if we want to reject a file and accept its includes, but I think the most important case is "tokenize the main file only" where that's not an issue. store information about preprocessor directives and macro replacements in each FileID I do think it's useful to treat these as similar things - both for the implementation of mapping between streams, and for users of the API. My hope is that most callsites won't have to consider issues of macros, PP directives, ifdef'd code, etc as separate issues. support an operation of mapping a subrange of expanded tokens into the subrange of raw-tokens in a particular FileID. This operation might fail Yes, though I don't have a great intuition for what the API should look like if you select part of an expansion, should the result include the macro invocation, exclude, or fail? if the expansion range ends at a zero-size expansion (like a PP directive), should it be included or excluded? if the result spans files, is that a failure or do we represent it somehow? Some of these probably need to be configurable (We also have the mapping in the other direction, which shouldn't fail)
clang/lib/Tooling/Syntax/TokenBuffer.cpp
42 ↗	(On Diff #192839)	this is recognizing keywords, right? add a comment?
70 ↗	(On Diff #192839)	I suspect this state isn't needed if we use a single tokenbuffer for all files

Not sure what the implications of design changes are, so will defer reviewing details of tokencollector (which generally looks good, but is of course highly coupled to lexer/pp) and range mapping (which I suspect could be simplified, but depends heavily on the model).

clang/lib/Tooling/Syntax/TokenBuffer.cpp
61 ↗	(On Diff #192839)	I'm afraid this really needs a class comment describing what this is supposed to do (easy!) and the implementation strategy (hard!) In particular, it looks like macros like this: we expect to see the tokens making up the macro invocation first (... FOO ( 42 )) then we see MacroExpands which allows us to recognize the last N tokens are a macro invocation. We create a MacroInvocation object, and remember where the invocation ends then we see the tokens making up the macro expansion finally, once we see the next token after the invocation, we finalize the MacroInvocation.
285 ↗	(On Diff #192839)	is there a problem if we never see another token in that file after the expansion? (or do we see the eof token in this case?)

Changes:

Add multi-file support, record a single expanded stream and per-file-id raw token streams and mappings.
Rename MacroInvocation to TokenBuffer::Mapping, make it private.
Simplify TokenCollector, let preprocessor handle some more stuff.

TODO:

update the docs
go through other comments again
write more tests

Herald added a subscriber: mgrang. · View Herald TranscriptApr 5 2019, 9:35 AM

Harbormaster completed remote builds in B30111: Diff 193903.Apr 5 2019, 9:36 AM

Remove a spamy debug tracing output.
Less debug output, move stuff around, more comments.
Add methods that map expanded tokens to raw tokens.
Rename toOffsetsRange.
Remove default ctor of Token.
Introduce a struct for storing FileID and a pair of offsets.
Update comments.
Add a test for macro expansion at the end of the file.
Misc fixes.

Harbormaster completed remote builds in B30371: Diff 194680.Apr 11 2019, 7:24 AM

Fix header comments.

The new version address most of the comments, there are just a few left in the code doing the mapping from Dmitri, I'll look into simplifying and removing the possibly redundant checks.
Please take a look at this version, this is very close to the final state.

clang/include/clang/Tooling/Syntax/TokenBuffer.h
8 ↗	(On Diff #192839)	Added a file comment explaining the basic concepts inside the file.
32 ↗	(On Diff #192839)	Got rid of default state altogether. I haven' seen the use-cases where it's important so far. Using `Optional<Token>` for invalid tokens seems like a cleaner design anyway (we did not need it so far).
60 ↗	(On Diff #192839)	The class is now internal to `TokenBuffer` and is called `Mapping`.
66 ↗	(On Diff #192839)	Done. I forgot that you could have members of type `vector<T>` where `T` is incomplete.
74 ↗	(On Diff #192839)	Done. I've used the name `FileRange` instead (the idea is that it's pointing to a substring in a file). Let me know what you think about the name.
87 ↗	(On Diff #192839)	This is now `BeginRawToken` and `EndRawToken`. As usually, open to suggestions wrt to naming.
92 ↗	(On Diff #192839)	Yes, though I don't have a great intuition for what the API should look like Agree, and I think there's room for all of these options. For the sake of simplicity in the initial implementation, I would simply pick the set of trade-offs that work for moving macro calls that span full syntax subtrees around and document them. When the use-cases pop up, we could add more configuration, refactor the APIs, etc.
117 ↗	(On Diff #192839)	Removed this constructor altogether, it does not make much sense in the new model. Instead, `tokenize()` now returns the raw tokens directly in `vector<Token>`.
146 ↗	(On Diff #192839)	Added a method to map from an expanded token range to a raw token range. Kept the method that maps an expanded token range to a character range too. Note that we don't currently have a way to map from raw token range to expanded, that could be added later, we don't need it for the initial use-case (map the ranges coming from AST nodes into raw source ranges), so I left it out for now.
72 ↗	(On Diff #192661)	The Mapping is now internal and only stores the indicies. The names for two kinds of those are "raw" and "expanded", happy to consider alternatives for both.
clang/lib/Tooling/Syntax/TokenBuffer.cpp
61 ↗	(On Diff #192839)	Added a comment. The model is now a bit simpler (more work is done by the preprocessor), but let me know if the comment is still unclear and could be improved.
285 ↗	(On Diff #192839)	Yeah, there should always be an eof. Added a comment and a test for this.

Harbormaster completed remote builds in B30372: Diff 194682.Apr 11 2019, 7:30 AM

Store a source manager in a token buffer

Harbormaster completed remote builds in B30373: Diff 194683.Apr 11 2019, 7:44 AM

sammccall added inline comments.Apr 16 2019, 7:59 AM

clang/include/clang/Tooling/Syntax/TokenBuffer.h
146 ↗	(On Diff #192839)	Note that we don't currently have a way to map from raw token range to expanded, that could be added later, we don't need it for the initial use-case (map the ranges coming from AST nodes into raw source ranges), so I left it out for now. That seems fine, can we add a comment somewhere (class header)? Not exactly a FIXME, but it would clarify that this is in principle a bidirectional mapping, just missing some implementation.
clang/include/clang/Tooling/Syntax/Tokens.h
49	not needed
60	Token should be copyable I think?
63	sadly, this would need documentation - it's the one-past-the-end location, not the last character in the token, not the location of the "next" token, or the location of the "next" character... I do wonder whether this is actually the right function to expose... Do you ever care about end but not start? (The reverse seems likelier). Having two independent source location accessors obscures the invariant that they have the same file ID. I think exposing a `FileRange` accessor instead might be better, but for now you could also make callers use `Tok.location().getLocWithOffset(Tok.length())` until we know it's the right pattern to encourage
79	(do we need to have two names for this version?)
96	may want to offer `length()` and `StringRef text(const SourceManager&)`
112	I think "Spelled" would be a better name than "Raw" here. (Despite being longer and grammatically awkward) The spelled/expanded dichotomy is well-established in clang-land, and this will help understand how they relate to each other and why the difference is important. It's true we'd be extending the meaning a bit (all PP activity, not just macros), but that applies to Expanded too. I think consistency is valuable here. "Raw" regarding tokens has an established meaning (raw lexer mode, raw_identifier etc) that AIUI you're not using here, but plausibly could be. So I think this may cause some confusion. There's like 100 occurrences of "raw" in this patch, happy to discuss first to avoid churn.
134	It's not clear from this comment what the current behavior is, and how it's exceptional. (It's a FIXME, but these sometimes last a while...)
134	nit: unclear from comment whether this is the only exception or just the only notable one
141	"cannot be determined unambiguously" is a little vague: "doesn't exactly correspond to a sequence of raw tokens"?
161	Not totally clear what the legal arguments are here: any sequence of tokens? a subrange of expandedTokens() (must point into that array) - most obvious, but in this case why does mapping an empty range always fail? any subrange of expandedTokens(), compared by value? - I think this is what the implementation assumes
161	I think `by` is a little weak semantically, I think it just means "the parameter is expanded tokens". `findRawForExpanded()` seems clearer. `rawTokensForExpanded()` would be clearer still I think.
166	this still seems a bit non-orthogonal: it fails if and only if findRawByExpanded fails, but that isn't visible to callers there's no way to avoid the redundant work of doing the mapping twice the name doesn't indicate that it maps to raw locations as well as translating tokens to locations seems likely to lead to combinatorial explosion, e.g. if we want a pair of expansion locations to a file range, or expansion locations to raw token range... Can this be a free function taking the raw token range instead, to be composed with `findRawByExpanded`? (Not sure if it belongs in this file or if it's a test helper)
clang/lib/Tooling/Syntax/Tokens.cpp
121	can you explain why it's not? it almost is
128	This sounds like an implementation limitation, rather than a desired part of the contract. maybe e.g. ` If the tokens in the range don't come from the same file, raw token mapping isn't defined. Because files span contiguous token ranges, if the first/last token have the same file, so do the ones in between.` (nit: no "the")
128	The part that is an implementation limitation, as discussed offline: #include "bar.h" int foo; expands to int bar; int foo; but if you try to map those tokens back, to the file it'll fail due to file ID mismatch between first and last token. Whereas the following will map backwards and forwards fine: int foo1; #include "bar.h"; int foo2; So I think requiring the fileIDs of begin/end to match is a bug, and instead we should walk up the `#include` stack to the closest common file id (while requiring that the range cover the whole #include). No need to implement this yet, but I think the case is worth documenting.
135	ah, I missed the pointer calculation here. Add an assert at the top of the function that the arrayref is in-range?
150	eeee std::lower_bound... as discussed offline, can reduce to one call with a helper? Something like: `pair<const Token, const Mapping> MarkedFile::rawForExpanded(const Token&)` This would bsearch to find the relevant mapping. (If the token isn't part of any mapping, you can find it by index arithmetic anchored on the next mapping/eof - so the second bsearch isn't necessary) Then you can call this twice for the first/last token (gaining symmetry by converting to an open range). This yields a token range, and the returned mappings can be verified (whole mapping should be covered).
234	can we add a FIXME for the things that aren't recorded? includes (other) preprocessor directives pp skipped sections
clang/unittests/Tooling/Syntax/TokensTest.cpp
2	A few high level things discussed offline: can we more clearly separate out tests of the token collector (testing the value of the token buffer returned) vs those of the tokenbuffer (testing the tokenbuffer's behavior) the token collector tests might be more clearly/tersely expressed as an exact assertion on a string representation of the tokenbuffer (maybe a special-purpose one)

Simplify rawByExpanded by using a helper function.

Harbormaster completed remote builds in B30636: Diff 195425.Apr 16 2019, 11:55 AM

Simplify rawByExpanded by using a helper function.
Add a FIXME to add spelled-to-expanded mapping
s/raw*/spelled*
Split token collector and token buffer tests
Rewrite dump function for use in tests
Simplify tests by using a dumpForTests function
Address other comments

ilya-biryukov added a parent revision: D61071: [Support] Add a GTest matcher for Optional<T>.Apr 24 2019, 8:38 AM

ilya-biryukov added inline comments.

clang/include/clang/Tooling/Syntax/Tokens.h
60	Sure, they are copyable. Am I missing something?
63	Added a comment. With the comment and an implementation being inline and trivial, I don't think anyone would have trouble understanding what this method does.
79	You mean to have distinct names for two different overloads? I wouldn't do it, they both have fairly similar outputs, could add a small comment that the one with SourceManager should probably be preferred if you have one.
96	SG. WDYT of exposing the struct members via getters too? This would mean uniform access for all things (beginOffset, endOffset, length) and adding a ctor that checks invariants (Begin <= End, File.isValid()) would also fit naturally.
112	Spelled looks fine. The only downside is that it intersects with SourceManager's definition of spelled (which is very similar, but not the same). Still like it more than raw.
161	Yes, `Expanded` should be a subrange of `expandedTokens()`. Added a comment.
166	I've changed this to accept a spelled token sequence, but kept it a member function. This allows to add an assertion that raw tokens come from a corresponding raw token stream.
clang/lib/Tooling/Syntax/Tokens.cpp
128	Done. I've added a fixme, it's not very detailed though (does not have an example), let me know if you feel it's necessary to add one.

ilya-biryukov added inline comments.Apr 24 2019, 8:38 AM

clang/lib/Tooling/Syntax/Tokens.cpp
135	We now have the corresponding assertions in a helper function.

Harbormaster completed remote builds in B30942: Diff 196462.Apr 24 2019, 8:39 AM

s/llvm::unittest::ValueIs/llvm::ValueIs.
Add tests for empty macro expansions and directives around macro expansions.
Record all gaps between spelled and expanded tokens.
Tweak test dump to distinguish different tokens with the same spelling.

Harbormaster completed remote builds in B31031: Diff 196847.Apr 26 2019, 7:23 AM

@sammccall, could we do another round of review? I think it's perfect now...
(Just kidding :-) )

Update a comment

Harbormaster completed remote builds in B31033: Diff 196849.Apr 26 2019, 7:27 AM

Rest is details only.

clang/include/clang/Tooling/Syntax/Tokens.h
60	No. I think I got confused by the explicit clang::Token constructor.
63	(this is ok. I do think a FileRange accessor would make client code more readable/less error-prone. Let's discuss offline)
76	there is no "other overload"
79	No sorry, I meant do we need both str() and operator<<
96	that sounds fine to me, though I don't feel strongly
clang/lib/Tooling/Syntax/Tokens.cpp
78	Nit: I'd suggest moving this all the way to the bottom or something? It's pretty big and seems a bit "in the way" here.
95	why not just take a ref?
101	you might want to move the token -> fileID calculation to a helper function (or method on token) called by `findRawByExpanded`, and then put this function onto `MarkedFile`. Reason is, computing the common #include ancestor (and therefore the file buffer to use) can't live in this function, which only sees one of the endpoints. But this can be deferred until that case is handled...
109	(renaming ExpandedIndex -> L seems confusing, use same name or just capture?)
131	As predicted :-) I think these `_<index>` suffixes are a maintenance hazard. In practice, the assertions are likely to add them by copy-pasting the test output. They guard against a class of error that doesn't seem very likely, and in practice they don't even really prevent it (because of the copy/paste issue). I'd suggest just dropping them, and living with the test assertions being slightly ambiguous. Failing that, some slightly trickier formatting could give more context: `A B C D E F` --> `A B ... E F` With special cases for smaller numbers of tokens. I don't like the irregularity of that approach though.
157	if you prefer, this is auto It = llvm::bsearch(File.Mappings, [&](const Mapping &M) { return M.BeginExpanded >= Expanded; }); I think this is clear enough that you can drop the comment, though I may be biased.
172	nit: "include root" or "include ancestor"?
clang/unittests/Tooling/Syntax/TokensTest.cpp
623	please fix :-)

This revision is now accepted and ready to land.Apr 29 2019, 12:54 AM

Simplify collection of tokens
Move dumping code to the bottom of the file

Harbormaster completed remote builds in B31146: Diff 197279.Apr 30 2019, 2:23 AM

ilya-biryukov marked 2 inline comments as done.Apr 30 2019, 2:24 AM

Get only expanded stream from the preprocessor, recompute mappings and spelled stream separately

Harbormaster completed remote builds in B31294: Diff 197803.May 2 2019, 9:24 AM

ilya-biryukov added a child revision: D61637: [Syntax] Introduce syntax trees.May 7 2019, 3:41 AM

sammccall added inline comments.May 7 2019, 7:45 AM

clang/lib/Tooling/Syntax/Tokens.cpp
131	(this one is still open)
217	what's "$[collect-tokens]"? Maybe just "Token: "?
225	this function is now >100 lines long. Can we split it up?
226	this could be a member, which would help splitting up
227	Is there a reason we do this at the end instead of as tokens are collected?
244	consumespelleduntil and fillgapuntil could be methods, I think
278	the body here could be a method too, I think i.e. for each (expanded token), process it?
323	and similarly this section
clang/unittests/Tooling/Syntax/TokensTest.cpp
623	(still missing?)

ilya-biryukov added inline comments.May 7 2019, 8:36 AM

clang/lib/Tooling/Syntax/Tokens.cpp
131	Will make sure to do something about it before submitting
clang/unittests/Tooling/Syntax/TokensTest.cpp
623	Will make sure to land this before submitting.

ilya-biryukov added a child revision: D61681: [clangd] A code tweak to expand a macro.May 8 2019, 7:28 AM

Move the constuction logic to a separate class, split into multiple methods.

Harbormaster completed remote builds in B31687: Diff 198857.May 9 2019, 9:46 AM

Move filling the gaps at the end of files to a separate function

Harbormaster completed remote builds in B31688: Diff 198862.May 9 2019, 9:51 AM

ilya-biryukov added inline comments.May 9 2019, 9:52 AM

clang/lib/Tooling/Syntax/Tokens.cpp
131	As mentioned in the offline conversations, we both agree that we don't want to have ambiguous test-cases. The proposed alternative was putting the counters of tokens of the same kind in the same token stream, with a goal of making updating test-cases simpler (now one would need to update only indicies of the tokens they changed). After playing around with the idea, I think this complicates the dumping function too much. The easiest way to update the test is to run it and copy-paste the expected output. So I'd keep as is and avoiding the extra complexity if that's ok
227	No strong reason, just wanted to keep the preprocessor callback minimal

ilya-biryukov marked 2 inline comments as done.May 9 2019, 9:53 AM

Check invariants on FileRange construction, unify access to all fields

ilya-biryukov marked 3 inline comments as done.May 9 2019, 10:15 AM

ilya-biryukov added inline comments.

clang/include/clang/Tooling/Syntax/Tokens.h
79	one can type `str()` in debugger, `operator <<` is for convenience when one is already using streams.

Harbormaster completed remote builds in B31690: Diff 198867.May 9 2019, 10:15 AM

ilya-biryukov marked 2 inline comments as done.May 9 2019, 10:17 AM

ilya-biryukov added inline comments.

clang/include/clang/Tooling/Syntax/Tokens.h
63	I've added a corresponding accessor (and a version of it that accepts a range of tokens) to D61681. I'd keep it off this patch for now, will refactor in a follow-up.

ilya-biryukov marked 2 inline comments as done.May 9 2019, 10:18 AM

Use bsearch instead of upper_bound

Harbormaster completed remote builds in B31691: Diff 198868.May 9 2019, 10:23 AM

Remove the function that maps tokens to offsets

Harbormaster completed remote builds in B32062: Diff 200019.May 17 2019, 4:27 AM

Skip annotation tokens, some of them are reported by the preprocessor but we don't want them in the final expanded stream.
Add functions to compute FileRange of tokens, add tests for it.

Harbormaster completed remote builds in B32066: Diff 200030.May 17 2019, 6:24 AM

sammccall accepted this revision.May 17 2019, 7:34 AM

sammccall added inline comments.

clang/include/clang/Tooling/Syntax/Tokens.h
50	nit: character range (just to be totally explicit)?
116	I think this might need a more explicit name. It's reasonably obvious that this needs to be optional for some cases (token pasting), but it's not obvious at callsites that (or why) we don't use the spelling location for macro-expanded tokens. One option would be just do that and add an expandedFromMacro() helper so the no-macros version is easy to express too. If we can't do that, maybe `directlySpelledRange` or something?
122	similar to above, I'd naively expect this to return a valid range, given the tokens expanded from `assert(X && [[Y.z()]] )`

ilya-biryukov added inline comments.May 17 2019, 9:38 AM

clang/include/clang/Tooling/Syntax/Tokens.h
116	As mentioned in the offline conversation, the idea is that mapping from a token inside a macro expansion to a spelled token should be handled by `TokenBuffer` and these two functions is really just a convenience helper to get to a range after the mapping. This change has been boiling for some time and I think the other bits of it seem to be non-controversial and agreed upon. Would it be ok if we land it with this function using a more concrete name (`directlySpelledRange` or something else) or move it into a separate change? There's a follow-up change that adds an 'expand macro' action to clangd, which both has a use-site for these method and provides a functional feature based on `TokenBuffer`. Iterating on the design with (even a single) use-case should be simpler.

sammccall added inline comments.May 17 2019, 9:59 AM

clang/include/clang/Tooling/Syntax/Tokens.h
116	If we do want to reflect expanded/spelled as types, this will rapidly get harder to change. But we do need to make progress here. If passing spelled tokens is the intended/well-understood use, let's just assert on that and not return Optional. Then I'm less worried about the name: misuse will be reliably punished.

Fix a comment, reformat
Use assertions instead of returning optionals

clang/include/clang/Tooling/Syntax/Tokens.h
116	Added an assertion. Not sure about the name, kept `range` for now for the lack of a better alternative.

Harbormaster completed remote builds in B32137: Diff 200213.May 20 2019, 12:54 AM

Closed by commit rC361148: [Syntax] Introduce TokenBuffer, start clangToolingSyntax library (authored by ibiryukov). · Explain WhyMay 20 2019, 5:59 AM

This revision was automatically updated to reflect the committed changes.

Out of interest: The RecursiveASTVisitorTests are part of the ToolingTests binary while this adds a new binary TokensTest. Can you say why?

(No change needed, just curious.)

Another comment: The new binary is called TokensTest but is in a directory "Syntax". For consistency with all other unit test binaries, please either rename the binary to SyntaxTests, or rename the directory to "Tokens". (From the patch description, the former seems more appropriate.) Note the missing trailing "s" in the binary name too.

In D59887#1508991, @thakis wrote:

Out of interest: The RecursiveASTVisitorTests are part of the ToolingTests binary while this adds a new binary TokensTest. Can you say why?

(No change needed, just curious.)

The syntax library is mostly independent from the rest of tooling, so I'd rather keep everything related to it separate including the tests.
I don't think we'll get anything in terms of code reuse from merging them into the same test binary either.

In D59887#1508993, @thakis wrote:

Another comment: The new binary is called TokensTest but is in a directory "Syntax". For consistency with all other unit test binaries, please either rename the binary to SyntaxTests, or rename the directory to "Tokens". (From the patch description, the former seems more appropriate.) Note the missing trailing "s" in the binary name too.

Agree with renaming the binary. In fact, the not-yet-landed revisions also use SyntaxTests, and I should've named it this way from the start. I'll land a patch.

nridge added a subscriber: nridge.Jan 9 2020, 3:24 PM

nridge added inline comments.

include/clang/Tooling/Syntax/Tokens.h
206 ↗	(On Diff #200260)	Is this comment correct? If so: Why are the tokens "int", "name", "=", "10" not included? Why are the tokens `"DECL", "(", "a", ")" not included?

ilya-biryukov marked 2 inline comments as done.Jan 9 2020, 11:07 PM

ilya-biryukov added inline comments.

include/clang/Tooling/Syntax/Tokens.h
206 ↗	(On Diff #200260)	It isn't. Thanks for catching that. This patch went through so many revisions, it was close to impossible to track this down.

ilya-biryukov marked 2 inline comments as done.Jan 9 2020, 11:17 PM

ilya-biryukov added inline comments.

include/clang/Tooling/Syntax/Tokens.h
206 ↗	(On Diff #200260)	Also: we do not store 'eof' in the spelled tokens anymore the FIXME is stale, we do store tokens of macro directives now 759c90456d418ffe69e1a2b4bcea2792491a6b5a updates the comment.

Revision Contents

Path

Size

clang/

include/

clang/

Tooling/

Syntax/

Tokens.h

291 lines

lib/

Tooling/

CMakeLists.txt

1 line

Syntax/

CMakeLists.txt

10 lines

Tokens.cpp

487 lines

unittests/

Tooling/

CMakeLists.txt

3 lines

Syntax/

CMakeLists.txt

20 lines

TokensTest.cpp

620 lines

Diff 200019

clang/include/clang/Tooling/Syntax/Tokens.h

This file was added.

				//===- Tokens.h - collect tokens from preprocessing --------------- C++--===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				// Record tokens that a preprocessor emits and define operations to map between
				// the tokens written in a file and tokens produced by the preprocessor.
				//
				// When running the compiler, there are two token streams we are interested in:
				// - "spelled" tokens directly correspond to a substring written in some
				// source file.
				// - "expanded" tokens represent the result of preprocessing, parses consumes
				// this token stream to produce the AST.
				//
				// Expanded tokens correspond directly to locations found in the AST, allowing
				// to find subranges of the token stream covered by various AST nodes. Spelled
				// tokens correspond directly to the source code written by the user.
				//
				// To allow composing these two use-cases, we also define operations that map
				// between expanded and spelled tokens that produced them (macro calls,
				// directives, etc).
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_CLANG_TOOLING_SYNTAX_TOKENS_H
				#define LLVM_CLANG_TOOLING_SYNTAX_TOKENS_H

				#include "clang/Basic/FileManager.h"
				#include "clang/Basic/LangOptions.h"
				#include "clang/Basic/SourceLocation.h"
				#include "clang/Basic/SourceManager.h"
				#include "clang/Basic/TokenKinds.h"
				#include "clang/Lex/Token.h"
				#include "llvm/ADT/ArrayRef.h"
				#include "llvm/ADT/Optional.h"
				#include "llvm/ADT/StringRef.h"
				#include "llvm/Support/Compiler.h"
				#include "llvm/Support/raw_ostream.h"
				#include <cstdint>
				#include <tuple>

				namespace clang {
				class Preprocessor;

				namespace syntax {

				/// A token coming directly from a file or from a macro invocation. Has just
				sammccallUnsubmitted Done Reply Inline Actions not needed sammccall: not needed
				/// enough information to locate the token in the source code.
				sammccallUnsubmitted Done Reply Inline Actions nit: character range (just to be totally explicit)? sammccall: nit: character range (just to be totally explicit)?
				/// Can represent both expanded and spelled tokens.
				class Token {
				public:
				Token(SourceLocation Location, unsigned Length, tok::TokenKind Kind)
				: Location(Location), Length(Length), Kind(Kind) {}
				/// EXPECTS: clang::Token is not an annotation token.
				explicit Token(const clang::Token &T);

				tok::TokenKind kind() const { return Kind; }
				/// Location of the first character of a token.
				sammccallUnsubmitted Done Reply Inline Actions Token should be copyable I think? sammccall: Token should be copyable I think?
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions Sure, they are copyable. Am I missing something? ilya-biryukov: Sure, they are copyable. Am I missing something?
				sammccallUnsubmitted Done Reply Inline Actions No. I think I got confused by the explicit clang::Token constructor. sammccall: No. I think I got confused by the explicit clang::Token constructor.
				SourceLocation location() const { return Location; }
				/// Location right after the last character of a token.
				SourceLocation endLocation() const {
				sammccallUnsubmitted Done Reply Inline Actions sadly, this would need documentation - it's the one-past-the-end location, not the last character in the token, not the location of the "next" token, or the location of the "next" character... I do wonder whether this is actually the right function to expose... Do you ever care about end but not start? (The reverse seems likelier). Having two independent source location accessors obscures the invariant that they have the same file ID. I think exposing a `FileRange` accessor instead might be better, but for now you could also make callers use `Tok.location().getLocWithOffset(Tok.length())` until we know it's the right pattern to encourage sammccall: sadly, this would need documentation - it's the one-past-the-end location, not the last…
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions Added a comment. With the comment and an implementation being inline and trivial, I don't think anyone would have trouble understanding what this method does. ilya-biryukov: Added a comment. With the comment and an implementation being inline and trivial, I don't think…
				sammccallUnsubmitted Done Reply Inline Actions (this is ok. I do think a FileRange accessor would make client code more readable/less error-prone. Let's discuss offline) sammccall: (this is ok. I do think a FileRange accessor would make client code more readable/less error…
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions I've added a corresponding accessor (and a version of it that accepts a range of tokens) to D61681. I'd keep it off this patch for now, will refactor in a follow-up. ilya-biryukov: I've added a corresponding accessor (and a version of it that accepts a range of tokens) to…
				return Location.getLocWithOffset(Length);
				}
				unsigned length() const { return Length; }

				/// Get the substring covered by the token. Note that will include all
				/// digraphs, newline continuations, etc. E.g. tokens for 'int' and
				/// in\
				/// t
				/// both have the same kind tok::kw_int, but results of text() are different.
				llvm::StringRef text(const SourceManager &SM) const;

				std::string dumpForTests(const SourceManager &SM) const;
				/// For debugging purposes.
				sammccallUnsubmitted Done Reply Inline Actions there is no "other overload" sammccall: there is no "other overload"
				std::string str() const;

				private:
				sammccallUnsubmitted Done Reply Inline Actions (do we need to have two names for this version?) sammccall: (do we need to have two names for this version?)
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions You mean to have distinct names for two different overloads? I wouldn't do it, they both have fairly similar outputs, could add a small comment that the one with SourceManager should probably be preferred if you have one. ilya-biryukov: You mean to have distinct names for two different overloads? I wouldn't do it, they both have…
				sammccallUnsubmitted Done Reply Inline Actions No sorry, I meant do we need both str() and operator<< sammccall: No sorry, I meant do we need both str() and operator<<
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions one can type `str()` in debugger, `operator <<` is for convenience when one is already using streams. ilya-biryukov: one can type `str()` in debugger, `operator <<` is for convenience when one is already using…
				SourceLocation Location;
				unsigned Length;
				tok::TokenKind Kind;
				};
				/// For debugging purposes. Equivalent to a call to Token::str().
				llvm::raw_ostream &operator<<(llvm::raw_ostream &OS, const Token &T);

				/// A half-open range inside a particular file, the start offset is included and
				/// the end offset is excluded from the range.
				struct FileRange {
				/// EXPECTS: File.isValid() && Begin <= End.
				FileRange(FileID File, unsigned BeginOffset, unsigned EndOffset);
				/// EXPECTS: BeginLoc.isValid() && BeginLoc.isFileID().
				FileRange(const SourceManager &SM, SourceLocation BeginLoc, unsigned Length);
				/// EXPECTS: BeginLoc.isValid() && BeginLoc.isFileID(), Begin <= End and files
				/// are the same.
				FileRange(const SourceManager &SM, SourceLocation BeginLoc, SourceLocation EndLoc);
				sammccallUnsubmitted Done Reply Inline Actions may want to offer `length()` and `StringRef text(const SourceManager&)` sammccall: may want to offer `length()` and `StringRef text(const SourceManager&)`
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions SG. WDYT of exposing the struct members via getters too? This would mean uniform access for all things (beginOffset, endOffset, length) and adding a ctor that checks invariants (Begin <= End, File.isValid()) would also fit naturally. ilya-biryukov: SG. WDYT of exposing the struct members via getters too? This would mean uniform access for…
				sammccallUnsubmitted Done Reply Inline Actions that sounds fine to me, though I don't feel strongly sammccall: that sounds fine to me, though I don't feel strongly

				FileID file() const { return File; }
				/// Start is a start offset (inclusive) in the corresponding file.
				unsigned beginOffset() const { return Begin; }
				/// End offset (exclusive) in the corresponding file.
				unsigned endOffset() const { return End; }

				unsigned length() const { return End - Begin; }

				/// Gets the substring that this FileRange refers to.
				llvm::StringRef text(const SourceManager &SM) const;

				friend bool operator==(const FileRange &L, const FileRange &R) {
				return std::tie(L.File, L.Begin, L.End) == std::tie(R.File, R.Begin, R.End);
				}
				friend bool operator!=(const FileRange &L, const FileRange &R) {
				sammccallUnsubmitted Done Reply Inline Actions I think "Spelled" would be a better name than "Raw" here. (Despite being longer and grammatically awkward) The spelled/expanded dichotomy is well-established in clang-land, and this will help understand how they relate to each other and why the difference is important. It's true we'd be extending the meaning a bit (all PP activity, not just macros), but that applies to Expanded too. I think consistency is valuable here. "Raw" regarding tokens has an established meaning (raw lexer mode, raw_identifier etc) that AIUI you're not using here, but plausibly could be. So I think this may cause some confusion. There's like 100 occurrences of "raw" in this patch, happy to discuss first to avoid churn. sammccall: I think "Spelled" would be a better name than "Raw" here. (Despite being longer and…
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions Spelled looks fine. The only downside is that it intersects with SourceManager's definition of spelled (which is very similar, but not the same). Still like it more than raw. ilya-biryukov: Spelled looks fine. The only downside is that it intersects with SourceManager's definition of…
				return !(L == R);
				}

				private:
				sammccallUnsubmitted Not Done Reply Inline Actions I think this might need a more explicit name. It's reasonably obvious that this needs to be optional for some cases (token pasting), but it's not obvious at callsites that (or why) we don't use the spelling location for macro-expanded tokens. One option would be just do that and add an expandedFromMacro() helper so the no-macros version is easy to express too. If we can't do that, maybe `directlySpelledRange` or something? sammccall: I think this might need a more explicit name. It's reasonably obvious that this needs to be…
				ilya-biryukovAuthorUnsubmitted Not Done Reply Inline Actions As mentioned in the offline conversation, the idea is that mapping from a token inside a macro expansion to a spelled token should be handled by `TokenBuffer` and these two functions is really just a convenience helper to get to a range after the mapping. This change has been boiling for some time and I think the other bits of it seem to be non-controversial and agreed upon. Would it be ok if we land it with this function using a more concrete name (`directlySpelledRange` or something else) or move it into a separate change? There's a follow-up change that adds an 'expand macro' action to clangd, which both has a use-site for these method and provides a functional feature based on `TokenBuffer`. Iterating on the design with (even a single) use-case should be simpler. ilya-biryukov: As mentioned in the offline conversation, the idea is that mapping from a token inside a macro…
				sammccallUnsubmitted Done Reply Inline Actions If we do want to reflect expanded/spelled as types, this will rapidly get harder to change. But we do need to make progress here. If passing spelled tokens is the intended/well-understood use, let's just assert on that and not return Optional. Then I'm less worried about the name: misuse will be reliably punished. sammccall: If we do want to reflect expanded/spelled as types, this will rapidly get harder to change. But…
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions Added an assertion. Not sure about the name, kept `range` for now for the lack of a better alternative. ilya-biryukov: Added an assertion. Not sure about the name, kept `range` for now for the lack of a better…
				FileID File;
				unsigned Begin;
				unsigned End;
				};

				/// For debugging purposes.
				sammccallUnsubmitted Not Done Reply Inline Actions similar to above, I'd naively expect this to return a valid range, given the tokens expanded from `assert(X && [[Y.z()]] )` sammccall: similar to above, I'd naively expect this to return a valid range, given the tokens expanded…
				llvm::raw_ostream &operator<<(llvm::raw_ostream &OS, const FileRange &R);

				/// A list of tokens obtained by preprocessing a text buffer and operations to
				/// map between the expanded and spelled tokens, i.e. TokenBuffer has
				/// information about two token streams:
				/// 1. Expanded tokens: tokens produced by the preprocessor after all macro
				/// replacements,
				/// 2. Spelled tokens: corresponding directly to the source code of a file
				/// before any macro replacements occurred.
				/// Here's an example to illustrate a difference between those two:
				/// #define FOO 10
				/// int a = FOO;
				sammccallUnsubmitted Done Reply Inline Actions It's not clear from this comment what the current behavior is, and how it's exceptional. (It's a FIXME, but these sometimes last a while...) sammccall: It's not clear from this comment what the current behavior is, and how it's exceptional.
				sammccallUnsubmitted Done Reply Inline Actions nit: unclear from comment whether this is the only exception or just the only notable one sammccall: nit: unclear from comment whether this is the only exception or just the only notable one
				///
				/// Spelled tokens are {'#','define','FOO','10','int','a','=','FOO',';'}.
				/// Expanded tokens are {'int','a','=','10',';','eof'}.
				///
				/// Note that the expanded token stream has a tok::eof token at the end, the
				/// spelled tokens never store a 'eof' token.
				///
				sammccallUnsubmitted Done Reply Inline Actions "cannot be determined unambiguously" is a little vague: "doesn't exactly correspond to a sequence of raw tokens"? sammccall: "cannot be determined unambiguously" is a little vague: "doesn't exactly correspond to a…
				/// The full list expanded tokens can be obtained with expandedTokens(). Spelled
				/// tokens for each of the files can be obtained via spelledTokens(FileID).
				///
				/// To map between the expanded and spelled tokens use findSpelledByExpanded().
				///
				/// To build a token buffer use the TokenCollector class. You can also compute
				/// the spelled tokens of a file using the tokenize() helper.
				///
				/// FIXME: allow to map from spelled to expanded tokens when use-case shows up.
				class TokenBuffer {
				public:
				TokenBuffer(const SourceManager &SourceMgr) : SourceMgr(&SourceMgr) {}
				/// All tokens produced by the preprocessor after all macro replacements,
				/// directives, etc. Source locations found in the clang AST will always
				/// point to one of these tokens.
				/// FIXME: figure out how to handle token splitting, e.g. '>>' can be split
				/// into two '>' tokens by the parser. However, TokenBuffer currently
				/// keeps it as a single '>>' token.
				llvm::ArrayRef<syntax::Token> expandedTokens() const {
				return ExpandedTokens;
				sammccallUnsubmitted Done Reply Inline Actions Not totally clear what the legal arguments are here: any sequence of tokens? a subrange of expandedTokens() (must point into that array) - most obvious, but in this case why does mapping an empty range always fail? any subrange of expandedTokens(), compared by value? - I think this is what the implementation assumes sammccall: Not totally clear what the legal arguments are here: - any sequence of tokens? - a subrange…
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions Yes, `Expanded` should be a subrange of `expandedTokens()`. Added a comment. ilya-biryukov: Yes, `Expanded` should be a subrange of `expandedTokens()`. Added a comment.
				sammccallUnsubmitted Done Reply Inline Actions I think `by` is a little weak semantically, I think it just means "the parameter is expanded tokens". `findRawForExpanded()` seems clearer. `rawTokensForExpanded()` would be clearer still I think. sammccall: I think `by` is a little weak semantically, I think it just means "the parameter is expanded…
				}

				/// Find the subrange of spelled tokens that produced the corresponding \p
				/// Expanded tokens.
				///
				sammccallUnsubmitted Done Reply Inline Actions this still seems a bit non-orthogonal: it fails if and only if findRawByExpanded fails, but that isn't visible to callers there's no way to avoid the redundant work of doing the mapping twice the name doesn't indicate that it maps to raw locations as well as translating tokens to locations seems likely to lead to combinatorial explosion, e.g. if we want a pair of expansion locations to a file range, or expansion locations to raw token range... Can this be a free function taking the raw token range instead, to be composed with `findRawByExpanded`? (Not sure if it belongs in this file or if it's a test helper) sammccall: this still seems a bit non-orthogonal: - it fails if and only if findRawByExpanded fails, but…
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions I've changed this to accept a spelled token sequence, but kept it a member function. This allows to add an assertion that raw tokens come from a corresponding raw token stream. ilya-biryukov: I've changed this to accept a spelled token sequence, but kept it a member function. This…
				/// EXPECTS: \p Expanded is a subrange of expandedTokens().
				///
				/// Will fail if the expanded tokens do not correspond to a
				/// sequence of spelled tokens. E.g. for the following example:
				///
				/// #define FIRST f1 f2 f3
				/// #define SECOND s1 s2 s3
				///
				/// a FIRST b SECOND c // expanded tokens are: a f1 f2 f3 b s1 s2 s3 c
				///
				/// the results would be:
				/// expanded => spelled
				/// ------------------------
				/// a => a
				/// s1 s2 s3 => SECOND
				/// a f1 f2 f3 => a FIRST
				/// a f1 => can't map
				/// s1 s2 => can't map
				///
				/// If \p Expanded is empty, the returned value is llvm::None.
				/// Complexity is logarithmic.
				llvm::Optional<llvm::ArrayRef<syntax::Token>>
				spelledForExpanded(llvm::ArrayRef<syntax::Token> Expanded) const;

				/// Lexed tokens of a file before preprocessing. E.g. for the following input
				/// #define DECL(name) int name = 10
				/// DECL(a);
				/// spelledTokens() returns {"#", "define", "DECL", "(", "name", ")", "eof"}.
				/// FIXME: we do not yet store tokens of directives, like #include, #define,
				/// #pragma, etc.
				llvm::ArrayRef<syntax::Token> spelledTokens(FileID FID) const;

				std::string dumpForTests() const;

				private:
				/// Describes a mapping between a continuous subrange of spelled tokens and
				/// expanded tokens. Represents macro expansions, preprocessor directives,
				/// conditionally disabled pp regions, etc.
				/// #define FOO 1+2
				/// #define BAR(a) a + 1
				/// FOO // invocation #1, tokens = {'1','+','2'}, macroTokens = {'FOO'}.
				/// BAR(1) // invocation #2, tokens = {'a', '+', '1'},
				/// macroTokens = {'BAR', '(', '1', ')'}.
				struct Mapping {
				// Positions in the corresponding spelled token stream. The corresponding
				// range is never empty.
				unsigned BeginSpelled = 0;
				unsigned EndSpelled = 0;
				// Positions in the expanded token stream. The corresponding range can be
				// empty.
				unsigned BeginExpanded = 0;
				unsigned EndExpanded = 0;

				/// For debugging purposes.
				std::string str() const;
				};
				/// Spelled tokens of the file with information about the subranges.
				struct MarkedFile {
				/// Lexed, but not preprocessed, tokens of the file. These map directly to
				/// text in the corresponding files and include tokens of all preprocessor
				/// directives.
				/// FIXME: spelled tokens don't change across FileID that map to the same
				/// FileEntry. We could consider deduplicating them to save memory.
				std::vector<syntax::Token> SpelledTokens;
				/// A sorted list to convert between the spelled and expanded token streams.
				std::vector<Mapping> Mappings;
				/// The first expanded token produced for this FileID.
				unsigned BeginExpanded = 0;
				unsigned EndExpanded = 0;
				};

				friend class TokenCollector;

				/// Maps a single expanded token to its spelled counterpart or a mapping that
				/// produced it.
				std::pair<const syntax::Token , const Mapping >
				spelledForExpandedToken(const syntax::Token *Expanded) const;

				/// Token stream produced after preprocessing, conceputally this captures the
				/// same stream as 'clang -E' (excluding the preprocessor directives like
				/// #file, etc.).
				std::vector<syntax::Token> ExpandedTokens;
				llvm::DenseMap<FileID, MarkedFile> Files;
				// The value is never null, pointer instead of reference to avoid disabling
				// implicit assignment operator.
				const SourceManager *SourceMgr;
				};

				/// Lex the text buffer, corresponding to \p FID, in raw mode and record the
				/// resulting spelled tokens. Does minimal post-processing on raw identifiers,
				/// setting the appropriate token kind (instead of the raw_identifier reported
				/// by lexer in raw mode). This is a very low-level function, most users should
				/// prefer to use TokenCollector. Lexing in raw mode produces wildly different
				/// results from what one might expect when running a C++ frontend, e.g.
				/// preprocessor does not run at all.
				/// The result will not have a 'eof' token at the end.
				std::vector<syntax::Token> tokenize(FileID FID, const SourceManager &SM,
				const LangOptions &LO);

				/// Collects tokens for the main file while running the frontend action. An
				/// instance of this object should be created on
				/// FrontendAction::BeginSourceFile() and the results should be consumed after
				/// FrontendAction::Execute() finishes.
				class TokenCollector {
				public:
				/// Adds the hooks to collect the tokens. Should be called before the
				/// preprocessing starts, i.e. as a part of BeginSourceFile() or
				/// CreateASTConsumer().
				TokenCollector(Preprocessor &P);

				/// Finalizes token collection. Should be called after preprocessing is
				/// finished, i.e. after running Execute().
				LLVM_NODISCARD TokenBuffer consume() &&;

				private:
				class Builder;
				std::vector<syntax::Token> Expanded;
				const SourceManager &SourceMgr;
				const LangOptions &LangOpts;
				};

				} // namespace syntax
				} // namespace clang

				#endif

clang/lib/Tooling/CMakeLists.txt

	set(LLVM_LINK_COMPONENTS			set(LLVM_LINK_COMPONENTS
	Option			Option
	Support			Support
	)			)

	add_subdirectory(Core)			add_subdirectory(Core)
	add_subdirectory(Inclusions)			add_subdirectory(Inclusions)
	add_subdirectory(Refactoring)			add_subdirectory(Refactoring)
	add_subdirectory(ASTDiff)			add_subdirectory(ASTDiff)
				add_subdirectory(Syntax)

	add_clang_library(clangTooling			add_clang_library(clangTooling
	AllTUsExecution.cpp			AllTUsExecution.cpp
	ArgumentsAdjusters.cpp			ArgumentsAdjusters.cpp
	CommonOptionsParser.cpp			CommonOptionsParser.cpp
	CompilationDatabase.cpp			CompilationDatabase.cpp
	Execution.cpp			Execution.cpp
	FileMatchTrie.cpp			FileMatchTrie.cpp
	Show All 23 Lines

clang/lib/Tooling/Syntax/CMakeLists.txt

This file was added.

				set(LLVM_LINK_COMPONENTS Support)

				add_clang_library(clangToolingSyntax
				Tokens.cpp

				LINK_LIBS
				clangBasic
				clangFrontend
				clangLex
				)

clang/lib/Tooling/Syntax/Tokens.cpp

This file was added.

				//===- Tokens.cpp - collect tokens from preprocessing ---------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				#include "clang/Tooling/Syntax/Tokens.h"

				#include "clang/Basic/Diagnostic.h"
				#include "clang/Basic/IdentifierTable.h"
				#include "clang/Basic/LLVM.h"
				#include "clang/Basic/LangOptions.h"
				#include "clang/Basic/SourceLocation.h"
				#include "clang/Basic/SourceManager.h"
				#include "clang/Basic/TokenKinds.h"
				#include "clang/Lex/Preprocessor.h"
				#include "clang/Lex/Token.h"
				#include "llvm/ADT/ArrayRef.h"
				#include "llvm/ADT/None.h"
				#include "llvm/ADT/Optional.h"
				#include "llvm/ADT/STLExtras.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Support/ErrorHandling.h"
				#include "llvm/Support/FormatVariadic.h"
				#include "llvm/Support/raw_ostream.h"
				#include <algorithm>
				#include <cassert>
				#include <iterator>
				#include <string>
				#include <utility>
				#include <vector>

				using namespace clang;
				using namespace clang::syntax;

				syntax::Token::Token(const clang::Token &T)
				: Token(T.getLocation(), T.getLength(), T.getKind()) {
				assert(!T.isAnnotation());
				}

				llvm::StringRef syntax::Token::text(const SourceManager &SM) const {
				bool Invalid = false;
				const char *Start = SM.getCharacterData(location(), &Invalid);
				assert(!Invalid);
				return llvm::StringRef(Start, length());
				}

				llvm::raw_ostream &syntax::operator<<(llvm::raw_ostream &OS, const Token &T) {
				return OS << T.str();
				}

				FileRange::FileRange(FileID File, unsigned BeginOffset, unsigned EndOffset)
				: File(File), Begin(BeginOffset), End(EndOffset) {
				assert(File.isValid());
				assert(BeginOffset <= EndOffset);
				}

				FileRange::FileRange(const SourceManager &SM, SourceLocation BeginLoc,
				unsigned Length) {
				assert(BeginLoc.isValid());
				assert(BeginLoc.isFileID());

				std::tie(File, Begin) = SM.getDecomposedLoc(BeginLoc);
				End = Begin + Length;
				}
				FileRange::FileRange(const SourceManager &SM, SourceLocation BeginLoc,
				SourceLocation EndLoc) {
				assert(BeginLoc.isValid());
				assert(BeginLoc.isFileID());
				assert(EndLoc.isValid());
				assert(EndLoc.isFileID());
				assert(SM.getFileID(BeginLoc) == SM.getFileID(EndLoc));
				assert(SM.getFileOffset(BeginLoc) <= SM.getFileOffset(EndLoc));

				std::tie(File, Begin) = SM.getDecomposedLoc(BeginLoc);
				End = SM.getFileOffset(EndLoc);
				}
				sammccallUnsubmitted Done Reply Inline Actions Nit: I'd suggest moving this all the way to the bottom or something? It's pretty big and seems a bit "in the way" here. sammccall: Nit: I'd suggest moving this all the way to the bottom or something? It's pretty big and seems…

				llvm::raw_ostream &syntax::operator<<(llvm::raw_ostream &OS,
				const FileRange &R) {
				return OS << llvm::formatv("FileRange(file = {0}, offsets = {1}-{2})",
				R.file().getHashValue(), R.beginOffset(),
				R.endOffset());
				}

				llvm::StringRef FileRange::text(const SourceManager &SM) const {
				bool Invalid = false;
				StringRef Text = SM.getBufferData(File, &Invalid);
				if (Invalid)
				return "";
				assert(Begin <= Text.size());
				assert(End <= Text.size());
				return Text.substr(Begin, length());
				}
				sammccallUnsubmitted Not Done Reply Inline Actions why not just take a ref? sammccall: why not just take a ref?

				std::pair<const syntax::Token , const TokenBuffer::Mapping >
				TokenBuffer::spelledForExpandedToken(const syntax::Token *Expanded) const {
				assert(Expanded);
				assert(ExpandedTokens.data() <= Expanded &&
				Expanded < ExpandedTokens.data() + ExpandedTokens.size());
				sammccallUnsubmitted Not Done Reply Inline Actions you might want to move the token -> fileID calculation to a helper function (or method on token) called by `findRawByExpanded`, and then put this function onto `MarkedFile`. Reason is, computing the common #include ancestor (and therefore the file buffer to use) can't live in this function, which only sees one of the endpoints. But this can be deferred until that case is handled... sammccall: you might want to move the token -> fileID calculation to a helper function (or method on…

				auto FileIt = Files.find(
				SourceMgr->getFileID(SourceMgr->getExpansionLoc(Expanded->location())));
				assert(FileIt != Files.end() && "no file for an expanded token");

				const MarkedFile &File = FileIt->second;

				unsigned ExpandedIndex = Expanded - ExpandedTokens.data();
				sammccallUnsubmitted Not Done Reply Inline Actions (renaming ExpandedIndex -> L seems confusing, use same name or just capture?) sammccall: (renaming ExpandedIndex -> L seems confusing, use same name or just capture?)
				// Find the first mapping that produced tokens after \p Expanded.
				auto It = llvm::bsearch(File.Mappings, [&](const Mapping &M) {
				return ExpandedIndex < M.BeginExpanded;
				});
				// Our token could only be produced by the previous mapping.
				if (It == File.Mappings.begin()) {
				// No previous mapping, no need to modify offsets.
				return {&File.SpelledTokens[ExpandedIndex - File.BeginExpanded], nullptr};
				}
				--It; // 'It' now points to last mapping that started before our token.

				// Check if the token is part of the mapping.
				sammccallUnsubmitted Done Reply Inline Actions can you explain why it's not? it almost is sammccall: can you explain why it's not? it almost is
				if (ExpandedIndex < It->EndExpanded)
				return {&File.SpelledTokens[It->BeginSpelled], /Mapping/ &*It};

				// Not part of the mapping, use the index from previous mapping to compute the
				// corresponding spelled token.
				return {
				&File.SpelledTokens[It->EndSpelled + (ExpandedIndex - It->EndExpanded)],
				sammccallUnsubmitted Done Reply Inline Actions This sounds like an implementation limitation, rather than a desired part of the contract. maybe e.g. ` If the tokens in the range don't come from the same file, raw token mapping isn't defined. Because files span contiguous token ranges, if the first/last token have the same file, so do the ones in between.` (nit: no "the") sammccall: This sounds like an implementation limitation, rather than a desired part of the contract.
				sammccallUnsubmitted Done Reply Inline Actions The part that is an implementation limitation, as discussed offline: #include "bar.h" int foo; expands to int bar; int foo; but if you try to map those tokens back, to the file it'll fail due to file ID mismatch between first and last token. Whereas the following will map backwards and forwards fine: int foo1; #include "bar.h"; int foo2; So I think requiring the fileIDs of begin/end to match is a bug, and instead we should walk up the `#include` stack to the closest common file id (while requiring that the range cover the whole #include). No need to implement this yet, but I think the case is worth documenting. sammccall: The part that is an implementation limitation, as discussed offline: ``` #include "bar.h" int…
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions Done. I've added a fixme, it's not very detailed though (does not have an example), let me know if you feel it's necessary to add one. ilya-biryukov: Done. I've added a fixme, it's not very detailed though (does not have an example), let me know…
				/Mapping/ nullptr};
				}

				sammccallUnsubmitted Not Done Reply Inline Actions As predicted :-) I think these `_<index>` suffixes are a maintenance hazard. In practice, the assertions are likely to add them by copy-pasting the test output. They guard against a class of error that doesn't seem very likely, and in practice they don't even really prevent it (because of the copy/paste issue). I'd suggest just dropping them, and living with the test assertions being slightly ambiguous. Failing that, some slightly trickier formatting could give more context: `A B C D E F` --> `A B ... E F` With special cases for smaller numbers of tokens. I don't like the irregularity of that approach though. sammccall: As predicted :-) I think these `_<index>` suffixes are a maintenance hazard. In practice, the…
				sammccallUnsubmitted Not Done Reply Inline Actions (this one is still open) sammccall: (this one is still open)
				ilya-biryukovAuthorUnsubmitted Not Done Reply Inline Actions Will make sure to do something about it before submitting ilya-biryukov: Will make sure to do something about it before submitting
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions As mentioned in the offline conversations, we both agree that we don't want to have ambiguous test-cases. The proposed alternative was putting the counters of tokens of the same kind in the same token stream, with a goal of making updating test-cases simpler (now one would need to update only indicies of the tokens they changed). After playing around with the idea, I think this complicates the dumping function too much. The easiest way to update the test is to run it and copy-paste the expected output. So I'd keep as is and avoiding the extra complexity if that's ok ilya-biryukov: As mentioned in the offline conversations, we both agree that we don't want to have ambiguous…
				llvm::ArrayRef<syntax::Token> TokenBuffer::spelledTokens(FileID FID) const {
				auto It = Files.find(FID);
				assert(It != Files.end());
				return It->second.SpelledTokens;
				sammccallUnsubmitted Done Reply Inline Actions ah, I missed the pointer calculation here. Add an assert at the top of the function that the arrayref is in-range? sammccall: ah, I missed the pointer calculation here. Add an assert at the top of the function that the…
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions We now have the corresponding assertions in a helper function. ilya-biryukov: We now have the corresponding assertions in a helper function.
				}

				std::string TokenBuffer::Mapping::str() const {
				return llvm::formatv("spelled tokens: [{0},{1}), expanded tokens: [{2},{3})",
				BeginSpelled, EndSpelled, BeginExpanded, EndExpanded);
				}

				llvm::Optional<llvm::ArrayRef<syntax::Token>>
				TokenBuffer::spelledForExpanded(llvm::ArrayRef<syntax::Token> Expanded) const {
				// Mapping an empty range is ambiguous in case of empty mappings at either end
				// of the range, bail out in that case.
				if (Expanded.empty())
				return llvm::None;

				// FIXME: also allow changes uniquely mapping to macro arguments.
				sammccallUnsubmitted Done Reply Inline Actions eeee std::lower_bound... as discussed offline, can reduce to one call with a helper? Something like: `pair<const Token, const Mapping> MarkedFile::rawForExpanded(const Token&)` This would bsearch to find the relevant mapping. (If the token isn't part of any mapping, you can find it by index arithmetic anchored on the next mapping/eof - so the second bsearch isn't necessary) Then you can call this twice for the first/last token (gaining symmetry by converting to an open range). This yields a token range, and the returned mappings can be verified (whole mapping should be covered). sammccall: eeee std::lower_bound... as discussed offline, can reduce to one call with a helper? Something…

				const syntax::Token *BeginSpelled;
				const Mapping *BeginMapping;
				std::tie(BeginSpelled, BeginMapping) =
				spelledForExpandedToken(&Expanded.front());

				const syntax::Token *LastSpelled;
				sammccallUnsubmitted Not Done Reply Inline Actions if you prefer, this is auto It = llvm::bsearch(File.Mappings, [&](const Mapping &M) { return M.BeginExpanded >= Expanded; }); I think this is clear enough that you can drop the comment, though I may be biased. sammccall: if you prefer, this is ``` auto It = llvm::bsearch(File.Mappings, [&]…
				const Mapping *LastMapping;
				std::tie(LastSpelled, LastMapping) =
				spelledForExpandedToken(&Expanded.back());

				FileID FID = SourceMgr->getFileID(BeginSpelled->location());
				// FIXME: Handle multi-file changes by trying to map onto a common root.
				if (FID != SourceMgr->getFileID(LastSpelled->location()))
				return llvm::None;

				const MarkedFile &File = Files.find(FID)->second;

				// Do not allow changes that cross macro expansion boundaries.
				unsigned BeginExpanded = Expanded.begin() - ExpandedTokens.data();
				unsigned EndExpanded = Expanded.end() - ExpandedTokens.data();
				if (BeginMapping && BeginMapping->BeginExpanded < BeginExpanded)
				sammccallUnsubmitted Not Done Reply Inline Actions nit: "include root" or "include ancestor"? sammccall: nit: "include root" or "include ancestor"?
				return llvm::None;
				if (LastMapping && EndExpanded < LastMapping->EndExpanded)
				return llvm::None;
				// All is good, return the result.
				return llvm::makeArrayRef(
				BeginMapping ? File.SpelledTokens.data() + BeginMapping->BeginSpelled
				: BeginSpelled,
				LastMapping ? File.SpelledTokens.data() + LastMapping->EndSpelled
				: LastSpelled + 1);
				}

				std::vector<syntax::Token> syntax::tokenize(FileID FID, const SourceManager &SM,
				const LangOptions &LO) {
				std::vector<syntax::Token> Tokens;
				IdentifierTable Identifiers(LO);
				auto AddToken = [&](clang::Token T) {
				// Fill the proper token kind for keywords, etc.
				if (T.getKind() == tok::raw_identifier && !T.needsCleaning() &&
				!T.hasUCN()) { // FIXME: support needsCleaning and hasUCN cases.
				clang::IdentifierInfo &II = Identifiers.get(T.getRawIdentifier());
				T.setIdentifierInfo(&II);
				T.setKind(II.getTokenID());
				}
				Tokens.push_back(syntax::Token(T));
				};

				Lexer L(FID, SM.getBuffer(FID), SM, LO);

				clang::Token T;
				while (!L.LexFromRawLexer(T))
				AddToken(T);
				// 'eof' is only the last token if the input is null-terminated. Never store
				// it, for consistency.
				if (T.getKind() != tok::eof)
				AddToken(T);
				return Tokens;
				}

				/// Fills in the TokenBuffer by tracing the run of a preprocessor. The
				/// implementation tracks the tokens, macro expansions and directives coming
				/// from the preprocessor and:
				/// - for each token, figures out if it is a part of an expanded token stream,
				/// spelled token stream or both. Stores the tokens appropriately.
				/// - records mappings from the spelled to expanded token ranges, e.g. for macro
				/// expansions.
				sammccallUnsubmitted Done Reply Inline Actions what's "$[collect-tokens]"? Maybe just "Token: "? sammccall: what's "$[collect-tokens]"? Maybe just "Token: "?
				/// FIXME: also properly record:
				/// - #include directives,
				/// - #pragma, #line and other PP directives,
				/// - skipped pp regions,
				/// - ...

				TokenCollector::TokenCollector(Preprocessor &PP)
				: SourceMgr(PP.getSourceManager()), LangOpts(PP.getLangOpts()) {
				sammccallUnsubmitted Done Reply Inline Actions this function is now >100 lines long. Can we split it up? sammccall: this function is now >100 lines long. Can we split it up?
				// Collect the expanded token stream during preprocessing.
				sammccallUnsubmitted Done Reply Inline Actions this could be a member, which would help splitting up sammccall: this could be a member, which would help splitting up
				PP.setTokenWatcher([this](const clang::Token &T) {
				sammccallUnsubmitted Not Done Reply Inline Actions Is there a reason we do this at the end instead of as tokens are collected? sammccall: Is there a reason we do this at the end instead of as tokens are collected?
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions No strong reason, just wanted to keep the preprocessor callback minimal ilya-biryukov: No strong reason, just wanted to keep the preprocessor callback minimal
				DEBUG_WITH_TYPE("collect-tokens",
				llvm::dbgs()
				<< "Token: " << syntax::Token(T).dumpForTests(SourceMgr)
				<< "\n"

				);
				Expanded.push_back(syntax::Token(T));
				sammccallUnsubmitted Done Reply Inline Actions can we add a FIXME for the things that aren't recorded? includes (other) preprocessor directives pp skipped sections sammccall: can we add a FIXME for the things that aren't recorded? - includes - (other) preprocessor…
				});
				}

				/// Builds mappings and spelled tokens in the TokenBuffer based on the expanded
				/// token stream.
				class TokenCollector::Builder {
				public:
				Builder(std::vector<syntax::Token> Expanded, const SourceManager &SM,
				const LangOptions &LangOpts)
				: Result(SM), SM(SM), LangOpts(LangOpts) {
				sammccallUnsubmitted Done Reply Inline Actions consumespelleduntil and fillgapuntil could be methods, I think sammccall: consumespelleduntil and fillgapuntil could be methods, I think
				Result.ExpandedTokens = std::move(Expanded);
				}

				TokenBuffer build() && {
				buildSpelledTokens();

				// Walk over expanded tokens and spelled tokens in parallel, building the
				// mappings between those using source locations.

				// The 'eof' token is special, it is not part of spelled token stream. We
				// handle it separately at the end.
				assert(!Result.ExpandedTokens.empty());
				assert(Result.ExpandedTokens.back().kind() == tok::eof);
				for (unsigned I = 0; I < Result.ExpandedTokens.size() - 1; ++I) {
				// (!) I might be updated by the following call.
				processExpandedToken(I);
				}

				// 'eof' not handled in the loop, do it here.
				assert(SM.getMainFileID() ==
				SM.getFileID(Result.ExpandedTokens.back().location()));
				fillGapUntil(Result.Files[SM.getMainFileID()],
				Result.ExpandedTokens.back().location(),
				Result.ExpandedTokens.size() - 1);
				Result.Files[SM.getMainFileID()].EndExpanded = Result.ExpandedTokens.size();

				// Some files might have unaccounted spelled tokens at the end, add an empty
				// mapping for those as they did not have expanded counterparts.
				fillGapsAtEndOfFiles();

				return std::move(Result);
				}

				private:
				sammccallUnsubmitted Done Reply Inline Actions the body here could be a method too, I think i.e. for each (expanded token), process it? sammccall: the body here could be a method too, I think i.e. for each (expanded token), process it?
				/// Process the next token in an expanded stream and move corresponding
				/// spelled tokens, record any mapping if needed.
				/// (!) \p I will be updated if this had to skip tokens, e.g. for macros.
				void processExpandedToken(unsigned &I) {
				auto L = Result.ExpandedTokens[I].location();
				if (L.isMacroID()) {
				processMacroExpansion(SM.getExpansionRange(L), I);
				return;
				}
				if (L.isFileID()) {
				auto FID = SM.getFileID(L);
				TokenBuffer::MarkedFile &File = Result.Files[FID];

				fillGapUntil(File, L, I);

				// Skip the token.
				assert(File.SpelledTokens[NextSpelled[FID]].location() == L &&
				"no corresponding token in the spelled stream");
				++NextSpelled[FID];
				return;
				}
				}

				/// Skipped expanded and spelled tokens of a macro expansion that covers \p
				/// SpelledRange. Add a corresponding mapping.
				/// (!) \p I will be the index of the last token in an expansion after this
				/// function returns.
				void processMacroExpansion(CharSourceRange SpelledRange, unsigned &I) {
				auto FID = SM.getFileID(SpelledRange.getBegin());
				assert(FID == SM.getFileID(SpelledRange.getEnd()));
				TokenBuffer::MarkedFile &File = Result.Files[FID];

				fillGapUntil(File, SpelledRange.getBegin(), I);

				TokenBuffer::Mapping M;
				// Skip the spelled macro tokens.
				std::tie(M.BeginSpelled, M.EndSpelled) =
				consumeSpelledUntil(File, SpelledRange.getEnd().getLocWithOffset(1));
				// Skip all expanded tokens from the same macro expansion.
				M.BeginExpanded = I;
				for (; I + 1 < Result.ExpandedTokens.size(); ++I) {
				auto NextL = Result.ExpandedTokens[I + 1].location();
				if (!NextL.isMacroID() \|\|
				SM.getExpansionLoc(NextL) != SpelledRange.getBegin())
				break;
				sammccallUnsubmitted Done Reply Inline Actions and similarly this section sammccall: and similarly this section
				}
				M.EndExpanded = I + 1;

				// Add a resulting mapping.
				File.Mappings.push_back(M);
				}

				/// Initializes TokenBuffer::Files and fills spelled tokens and expanded
				/// ranges for each of the files.
				void buildSpelledTokens() {
				for (unsigned I = 0; I < Result.ExpandedTokens.size(); ++I) {
				auto FID =
				SM.getFileID(SM.getExpansionLoc(Result.ExpandedTokens[I].location()));
				auto It = Result.Files.try_emplace(FID);
				TokenBuffer::MarkedFile &File = It.first->second;

				File.EndExpanded = I + 1;
				if (!It.second)
				continue; // we have seen this file before.

				// This is the first time we see this file.
				File.BeginExpanded = I;
				File.SpelledTokens = tokenize(FID, SM, LangOpts);
				}
				}

				/// Consumed spelled tokens until location L is reached (token starting at L
				/// is not included). Returns the indicies of the consumed range.
				std::pair</Begin/ unsigned, /End/ unsigned>
				consumeSpelledUntil(TokenBuffer::MarkedFile &File, SourceLocation L) {
				assert(L.isFileID());
				FileID FID;
				unsigned Offset;
				std::tie(FID, Offset) = SM.getDecomposedLoc(L);

				// (!) we update the index in-place.
				unsigned &SpelledI = NextSpelled[FID];
				unsigned Before = SpelledI;
				for (; SpelledI < File.SpelledTokens.size() &&
				SM.getFileOffset(File.SpelledTokens[SpelledI].location()) < Offset;
				++SpelledI) {
				}
				return std::make_pair(Before, /After/ SpelledI);
				};

				/// Consumes spelled tokens until location \p L is reached and adds a mapping
				/// covering the consumed tokens. The mapping will point to an empty expanded
				/// range at position \p ExpandedIndex.
				void fillGapUntil(TokenBuffer::MarkedFile &File, SourceLocation L,
				unsigned ExpandedIndex) {
				unsigned BeginSpelledGap, EndSpelledGap;
				std::tie(BeginSpelledGap, EndSpelledGap) = consumeSpelledUntil(File, L);
				if (BeginSpelledGap == EndSpelledGap)
				return; // No gap.
				TokenBuffer::Mapping M;
				M.BeginSpelled = BeginSpelledGap;
				M.EndSpelled = EndSpelledGap;
				M.BeginExpanded = M.EndExpanded = ExpandedIndex;
				File.Mappings.push_back(M);
				};

				/// Adds empty mappings for unconsumed spelled tokens at the end of each file.
				void fillGapsAtEndOfFiles() {
				for (auto &F : Result.Files) {
				unsigned Next = NextSpelled[F.first];
				if (F.second.SpelledTokens.size() == Next)
				continue; // All spelled tokens are accounted for.

				// Record a mapping for the gap at the end of the spelled tokens.
				TokenBuffer::Mapping M;
				M.BeginSpelled = Next;
				M.EndSpelled = F.second.SpelledTokens.size();
				M.BeginExpanded = F.second.EndExpanded;
				M.EndExpanded = F.second.EndExpanded;

				F.second.Mappings.push_back(M);
				}
				}

				TokenBuffer Result;
				/// For each file, a position of the next spelled token we will consume.
				llvm::DenseMap<FileID, unsigned> NextSpelled;
				const SourceManager &SM;
				const LangOptions &LangOpts;
				};

				TokenBuffer TokenCollector::consume() && {
				return Builder(std::move(Expanded), SourceMgr, LangOpts).build();
				}

				std::string syntax::Token::str() const {
				return llvm::formatv("Token({0}, length = {1})", tok::getTokenName(kind()),
				length());
				}

				std::string syntax::Token::dumpForTests(const SourceManager &SM) const {
				return llvm::formatv("{0} {1}", tok::getTokenName(kind()), text(SM));
				}

				std::string TokenBuffer::dumpForTests() const {
				auto PrintToken = [this](const syntax::Token &T) -> std::string {
				if (T.kind() == tok::eof)
				return "<eof>";
				return T.text(*SourceMgr);
				};

				auto DumpTokens = [this, &PrintToken](llvm::raw_ostream &OS,
				llvm::ArrayRef<syntax::Token> Tokens) {
				if (Tokens.size() == 1) {
				assert(Tokens[0].kind() == tok::eof);
				OS << "<empty>";
				return;
				}
				OS << Tokens[0].text(*SourceMgr);
				for (unsigned I = 1; I < Tokens.size(); ++I) {
				if (Tokens[I].kind() == tok::eof)
				continue;
				OS << " " << PrintToken(Tokens[I]);
				}
				};

				std::string Dump;
				llvm::raw_string_ostream OS(Dump);

				OS << "expanded tokens:\n"
				<< " ";
				DumpTokens(OS, ExpandedTokens);
				OS << "\n";

				std::vector<FileID> Keys;
				for (auto F : Files)
				Keys.push_back(F.first);
				llvm::sort(Keys);

				for (FileID ID : Keys) {
				const MarkedFile &File = Files.find(ID)->second;
				auto *Entry = SourceMgr->getFileEntryForID(ID);
				if (!Entry)
				continue; // Skip builtin files.
				OS << llvm::formatv("file '{0}'\n", Entry->getName())
				<< " spelled tokens:\n"
				<< " ";
				DumpTokens(OS, File.SpelledTokens);
				OS << "\n";

				if (File.Mappings.empty()) {
				OS << " no mappings.\n";
				continue;
				}
				OS << " mappings:\n";
				for (auto &M : File.Mappings) {
				OS << llvm::formatv(
				" ['{0}'_{1}, '{2}'_{3}) => ['{4}'_{5}, '{6}'_{7})\n",
				PrintToken(File.SpelledTokens[M.BeginSpelled]), M.BeginSpelled,
				M.EndSpelled == File.SpelledTokens.size()
				? "<eof>"
				: PrintToken(File.SpelledTokens[M.EndSpelled]),
				M.EndSpelled, PrintToken(ExpandedTokens[M.BeginExpanded]),
				M.BeginExpanded, PrintToken(ExpandedTokens[M.EndExpanded]),
				M.EndExpanded);
				}
				}
				return OS.str();
				}

clang/unittests/Tooling/CMakeLists.txt

Show First 20 Lines • Show All 64 Lines • ▼ Show 20 Lines	target_link_libraries(ToolingTests
clangLex		clangLex
clangRewrite		clangRewrite
clangSerialization		clangSerialization
clangTooling		clangTooling
clangToolingCore		clangToolingCore
clangToolingInclusions		clangToolingInclusions
clangToolingRefactor		clangToolingRefactor
)		)


		add_subdirectory(Syntax)

clang/unittests/Tooling/Syntax/CMakeLists.txt

This file was added.

				set(LLVM_LINK_COMPONENTS
				${LLVM_TARGETS_TO_BUILD}
				Support
				)

				add_clang_unittest(TokensTest
				TokensTest.cpp
				)

				target_link_libraries(TokensTest
				PRIVATE
				clangAST
				clangBasic
				clangFrontend
				clangLex
				clangSerialization
				clangTooling
				clangToolingSyntax
				LLVMTestingSupport
				)

clang/unittests/Tooling/Syntax/TokensTest.cpp

This file was added.

				//===- TokensTest.cpp -----------------------------------------------------===//
				//
				sammccallUnsubmitted Done Reply Inline Actions A few high level things discussed offline: can we more clearly separate out tests of the token collector (testing the value of the token buffer returned) vs those of the tokenbuffer (testing the tokenbuffer's behavior) the token collector tests might be more clearly/tersely expressed as an exact assertion on a string representation of the tokenbuffer (maybe a special-purpose one) sammccall: A few high level things discussed offline: - can we more clearly separate out tests of the…
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "clang/Tooling/Syntax/Tokens.h"
				#include "clang/AST/ASTConsumer.h"
				#include "clang/AST/Expr.h"
				#include "clang/Basic/Diagnostic.h"
				#include "clang/Basic/DiagnosticIDs.h"
				#include "clang/Basic/DiagnosticOptions.h"
				#include "clang/Basic/FileManager.h"
				#include "clang/Basic/FileSystemOptions.h"
				#include "clang/Basic/LLVM.h"
				#include "clang/Basic/LangOptions.h"
				#include "clang/Basic/SourceLocation.h"
				#include "clang/Basic/SourceManager.h"
				#include "clang/Basic/TokenKinds.def"
				#include "clang/Basic/TokenKinds.h"
				#include "clang/Frontend/CompilerInstance.h"
				#include "clang/Frontend/FrontendAction.h"
				#include "clang/Frontend/Utils.h"
				#include "clang/Lex/Lexer.h"
				#include "clang/Lex/PreprocessorOptions.h"
				#include "clang/Lex/Token.h"
				#include "clang/Tooling/Tooling.h"
				#include "llvm/ADT/ArrayRef.h"
				#include "llvm/ADT/IntrusiveRefCntPtr.h"
				#include "llvm/ADT/None.h"
				#include "llvm/ADT/Optional.h"
				#include "llvm/ADT/STLExtras.h"
				#include "llvm/ADT/StringRef.h"
				#include "llvm/Support/FormatVariadic.h"
				#include "llvm/Support/MemoryBuffer.h"
				#include "llvm/Support/VirtualFileSystem.h"
				#include "llvm/Support/raw_os_ostream.h"
				#include "llvm/Support/raw_ostream.h"
				#include "llvm/Testing/Support/Annotations.h"
				#include "llvm/Testing/Support/SupportHelpers.h"
				#include "gmock/gmock-matchers.h"
				#include "gmock/gmock-more-matchers.h"
				#include <cassert>
				#include <cstdlib>
				#include <gmock/gmock.h>
				#include <gtest/gtest.h>
				#include <memory>
				#include <ostream>
				#include <string>

				using namespace clang;
				using namespace clang::syntax;

				using llvm::ValueIs;
				using ::testing::AllOf;
				using ::testing::Contains;
				using ::testing::ElementsAre;
				using ::testing::Matcher;
				using ::testing::Not;
				using ::testing::StartsWith;

				namespace {
				// Checks the passed ArrayRef<T> has the same begin() and end() iterators as the
				// argument.
				MATCHER_P(SameRange, A, "") {
				return A.begin() == arg.begin() && A.end() == arg.end();
				}
				// Matchers for syntax::Token.
				MATCHER_P(Kind, K, "") { return arg.kind() == K; }
				MATCHER_P2(HasText, Text, SourceMgr, "") {
				return arg.text(*SourceMgr) == Text;
				}
				/// Checks the start and end location of a token are equal to SourceRng.
				MATCHER_P(RangeIs, SourceRng, "") {
				return arg.location() == SourceRng.first &&
				arg.endLocation() == SourceRng.second;
				}

				class TokenCollectorTest : public ::testing::Test {
				public:
				/// Run the clang frontend, collect the preprocessed tokens from the frontend
				/// invocation and store them in this->Buffer.
				/// This also clears SourceManager before running the compiler.
				void recordTokens(llvm::StringRef Code) {
				class RecordTokens : public ASTFrontendAction {
				public:
				explicit RecordTokens(TokenBuffer &Result) : Result(Result) {}

				bool BeginSourceFileAction(CompilerInstance &CI) override {
				assert(!Collector && "expected only a single call to BeginSourceFile");
				Collector.emplace(CI.getPreprocessor());
				return true;
				}
				void EndSourceFileAction() override {
				assert(Collector && "BeginSourceFileAction was never called");
				Result = std::move(*Collector).consume();
				}

				std::unique_ptr<ASTConsumer>
				CreateASTConsumer(CompilerInstance &CI, StringRef InFile) override {
				return llvm::make_unique<ASTConsumer>();
				}

				private:
				TokenBuffer &Result;
				llvm::Optional<TokenCollector> Collector;
				};

				constexpr const char *FileName = "./input.cpp";
				FS->addFile(FileName, time_t(), llvm::MemoryBuffer::getMemBufferCopy(""));
				// Prepare to run a compiler.
				std::vector<const char *> Args = {"tok-test", "-std=c++03", "-fsyntax-only",
				FileName};
				auto CI = createInvocationFromCommandLine(Args, Diags, FS);
				assert(CI);
				CI->getFrontendOpts().DisableFree = false;
				CI->getPreprocessorOpts().addRemappedFile(
				FileName, llvm::MemoryBuffer::getMemBufferCopy(Code).release());
				CompilerInstance Compiler;
				Compiler.setInvocation(std::move(CI));
				if (!Diags->getClient())
				Diags->setClient(new IgnoringDiagConsumer);
				Compiler.setDiagnostics(Diags.get());
				Compiler.setFileManager(FileMgr.get());
				Compiler.setSourceManager(SourceMgr.get());

				this->Buffer = TokenBuffer(*SourceMgr);
				RecordTokens Recorder(this->Buffer);
				ASSERT_TRUE(Compiler.ExecuteAction(Recorder))
				<< "failed to run the frontend";
				}

				/// Record the tokens and return a test dump of the resulting buffer.
				std::string collectAndDump(llvm::StringRef Code) {
				recordTokens(Code);
				return Buffer.dumpForTests();
				}

				// Adds a file to the test VFS.
				void addFile(llvm::StringRef Path, llvm::StringRef Contents) {
				if (!FS->addFile(Path, time_t(),
				llvm::MemoryBuffer::getMemBufferCopy(Contents))) {
				ADD_FAILURE() << "could not add a file to VFS: " << Path;
				}
				}

				/// Add a new file, run syntax::tokenize() on it and return the results.
				std::vector<syntax::Token> tokenize(llvm::StringRef Text) {
				// FIXME: pass proper LangOptions.
				return syntax::tokenize(
				SourceMgr->createFileID(llvm::MemoryBuffer::getMemBufferCopy(Text)),
				*SourceMgr, LangOptions());
				}

				// Specialized versions of matchers that hide the SourceManager from clients.
				Matcher<syntax::Token> HasText(std::string Text) const {
				return ::HasText(Text, SourceMgr.get());
				}
				Matcher<syntax::Token> RangeIs(llvm::Annotations::Range R) const {
				std::pair<SourceLocation, SourceLocation> Ls;
				Ls.first = SourceMgr->getLocForStartOfFile(SourceMgr->getMainFileID())
				.getLocWithOffset(R.Begin);
				Ls.second = SourceMgr->getLocForStartOfFile(SourceMgr->getMainFileID())
				.getLocWithOffset(R.End);
				return ::RangeIs(Ls);
				}

				/// Finds a subrange in O(n * m).
				template <class T, class U, class Eq>
				llvm::ArrayRef<T> findSubrange(llvm::ArrayRef<U> Subrange,
				llvm::ArrayRef<T> Range, Eq F) {
				for (auto Begin = Range.begin(); Begin < Range.end(); ++Begin) {
				auto It = Begin;
				for (auto ItSub = Subrange.begin();
				ItSub != Subrange.end() && It != Range.end(); ++ItSub, ++It) {
				if (!F(ItSub, It))
				goto continue_outer;
				}
				return llvm::makeArrayRef(Begin, It);
				continue_outer:;
				}
				return llvm::makeArrayRef(Range.end(), Range.end());
				}

				/// Finds a subrange in \p Tokens that match the tokens specified in \p Query.
				/// The match should be unique. \p Query is a whitespace-separated list of
				/// tokens to search for.
				llvm::ArrayRef<syntax::Token>
				findTokenRange(llvm::StringRef Query, llvm::ArrayRef<syntax::Token> Tokens) {
				llvm::SmallVector<llvm::StringRef, 8> QueryTokens;
				Query.split(QueryTokens, ' ', /MaxSplit=/-1, /KeepEmpty=/false);
				if (QueryTokens.empty()) {
				ADD_FAILURE() << "will not look for an empty list of tokens";
				std::abort();
				}
				// An equality test for search.
				auto TextMatches = [this](llvm::StringRef Q, const syntax::Token &T) {
				return Q == T.text(*SourceMgr);
				};
				// Find a match.
				auto Found =
				findSubrange(llvm::makeArrayRef(QueryTokens), Tokens, TextMatches);
				if (Found.begin() == Tokens.end()) {
				ADD_FAILURE() << "could not find the subrange for " << Query;
				std::abort();
				}
				// Check that the match is unique.
				if (findSubrange(llvm::makeArrayRef(QueryTokens),
				llvm::makeArrayRef(Found.end(), Tokens.end()), TextMatches)
				.begin() != Tokens.end()) {
				ADD_FAILURE() << "match is not unique for " << Query;
				std::abort();
				}
				return Found;
				};

				// Specialized versions of findTokenRange for expanded and spelled tokens.
				llvm::ArrayRef<syntax::Token> findExpanded(llvm::StringRef Query) {
				return findTokenRange(Query, Buffer.expandedTokens());
				}
				llvm::ArrayRef<syntax::Token> findSpelled(llvm::StringRef Query,
				FileID File = FileID()) {
				if (!File.isValid())
				File = SourceMgr->getMainFileID();
				return findTokenRange(Query, Buffer.spelledTokens(File));
				}

				// Data fields.
				llvm::IntrusiveRefCntPtr<DiagnosticsEngine> Diags =
				new DiagnosticsEngine(new DiagnosticIDs, new DiagnosticOptions);
				IntrusiveRefCntPtr<llvm::vfs::InMemoryFileSystem> FS =
				new llvm::vfs::InMemoryFileSystem;
				llvm::IntrusiveRefCntPtr<FileManager> FileMgr =
				new FileManager(FileSystemOptions(), FS);
				llvm::IntrusiveRefCntPtr<SourceManager> SourceMgr =
				new SourceManager(Diags, FileMgr);
				/// Contains last result of calling recordTokens().
				TokenBuffer Buffer = TokenBuffer(*SourceMgr);
				};

				TEST_F(TokenCollectorTest, RawMode) {
				EXPECT_THAT(tokenize("int main() {}"),
				ElementsAre(Kind(tok::kw_int),
				AllOf(HasText("main"), Kind(tok::identifier)),
				Kind(tok::l_paren), Kind(tok::r_paren),
				Kind(tok::l_brace), Kind(tok::r_brace)));
				// Comments are ignored for now.
				EXPECT_THAT(tokenize("/* foo */int a; // more comments"),
				ElementsAre(Kind(tok::kw_int),
				AllOf(HasText("a"), Kind(tok::identifier)),
				Kind(tok::semi)));
				}

				TEST_F(TokenCollectorTest, Basic) {
				std::pair</Input/ std::string, /Expected/ std::string> TestCases[] = {
				{"int main() {}",
				R"(expanded tokens:
				int main ( ) { }
				file './input.cpp'
				spelled tokens:
				int main ( ) { }
				no mappings.
				)"},
				// All kinds of whitespace are ignored.
				{"\t\n int\t\n main\t\n (\t\n )\t\n{\t\n }\t\n",
				R"(expanded tokens:
				int main ( ) { }
				file './input.cpp'
				spelled tokens:
				int main ( ) { }
				no mappings.
				)"}};
				for (auto &Test : TestCases)
				EXPECT_EQ(collectAndDump(Test.first), Test.second);
				}

				TEST_F(TokenCollectorTest, Locations) {
				// Check locations of the tokens.
				llvm::Annotations Code(R"cpp(
				$r1[[int]] $r2[[a]] $r3[[=]] $r4[["foo bar baz"]] $r5[[;]]
				)cpp");
				recordTokens(Code.code());
				// Check expanded tokens.
				EXPECT_THAT(
				Buffer.expandedTokens(),
				ElementsAre(AllOf(Kind(tok::kw_int), RangeIs(Code.range("r1"))),
				AllOf(Kind(tok::identifier), RangeIs(Code.range("r2"))),
				AllOf(Kind(tok::equal), RangeIs(Code.range("r3"))),
				AllOf(Kind(tok::string_literal), RangeIs(Code.range("r4"))),
				AllOf(Kind(tok::semi), RangeIs(Code.range("r5"))),
				Kind(tok::eof)));
				// Check spelled tokens.
				EXPECT_THAT(
				Buffer.spelledTokens(SourceMgr->getMainFileID()),
				ElementsAre(AllOf(Kind(tok::kw_int), RangeIs(Code.range("r1"))),
				AllOf(Kind(tok::identifier), RangeIs(Code.range("r2"))),
				AllOf(Kind(tok::equal), RangeIs(Code.range("r3"))),
				AllOf(Kind(tok::string_literal), RangeIs(Code.range("r4"))),
				AllOf(Kind(tok::semi), RangeIs(Code.range("r5")))));
				}

				TEST_F(TokenCollectorTest, MacroDirectives) {
				// Macro directives are not stored anywhere at the moment.
				std::string Code = R"cpp(
				#define FOO a
				#include "unresolved_file.h"
				#undef FOO
				#ifdef X
				#else
				#endif
				#ifndef Y
				#endif
				#if 1
				#elif 2
				#else
				#endif
				#pragma once
				#pragma something lalala

				int a;
				)cpp";
				std::string Expected =
				"expanded tokens:\n"
				" int a ;\n"
				"file './input.cpp'\n"
				" spelled tokens:\n"
				" # define FOO a # include \"unresolved_file.h\" # undef FOO "
				"# ifdef X # else # endif # ifndef Y # endif # if 1 # elif 2 # else "
				"# endif # pragma once # pragma something lalala int a ;\n"
				" mappings:\n"
				" ['#'_0, 'int'_39) => ['int'_0, 'int'_0)\n";
				EXPECT_EQ(collectAndDump(Code), Expected);
				}

				TEST_F(TokenCollectorTest, MacroReplacements) {
				std::pair</Input/ std::string, /Expected/ std::string> TestCases[] = {
				// A simple object-like macro.
				{R"cpp(
				#define INT int const
				INT a;
				)cpp",
				R"(expanded tokens:
				int const a ;
				file './input.cpp'
				spelled tokens:
				# define INT int const INT a ;
				mappings:
				['#'_0, 'INT'_5) => ['int'_0, 'int'_0)
				['INT'_5, 'a'_6) => ['int'_0, 'a'_2)
				)"},
				// A simple function-like macro.
				{R"cpp(
				#define INT(a) const int
				INT(10+10) a;
				)cpp",
				R"(expanded tokens:
				const int a ;
				file './input.cpp'
				spelled tokens:
				# define INT ( a ) const int INT ( 10 + 10 ) a ;
				mappings:
				['#'_0, 'INT'_8) => ['const'_0, 'const'_0)
				['INT'_8, 'a'_14) => ['const'_0, 'a'_2)
				)"},
				// Recursive macro replacements.
				{R"cpp(
				#define ID(X) X
				#define INT int const
				ID(ID(INT)) a;
				)cpp",
				R"(expanded tokens:
				int const a ;
				file './input.cpp'
				spelled tokens:
				# define ID ( X ) X # define INT int const ID ( ID ( INT ) ) a ;
				mappings:
				['#'_0, 'ID'_12) => ['int'_0, 'int'_0)
				['ID'_12, 'a'_19) => ['int'_0, 'a'_2)
				)"},
				// A little more complicated recursive macro replacements.
				{R"cpp(
				#define ADD(X, Y) X+Y
				#define MULT(X, Y) X*Y

				int a = ADD(MULT(1,2), MULT(3,ADD(4,5)));
				)cpp",
				"expanded tokens:\n"
				" int a = 1 * 2 + 3 * 4 + 5 ;\n"
				"file './input.cpp'\n"
				" spelled tokens:\n"
				" # define ADD ( X , Y ) X + Y # define MULT ( X , Y ) X * Y int "
				"a = ADD ( MULT ( 1 , 2 ) , MULT ( 3 , ADD ( 4 , 5 ) ) ) ;\n"
				" mappings:\n"
				" ['#'_0, 'int'_22) => ['int'_0, 'int'_0)\n"
				" ['ADD'_25, ';'_46) => ['1'_3, ';'_12)\n"},
				// Empty macro replacement.
				{R"cpp(
				#define EMPTY
				#define EMPTY_FUNC(X)
				EMPTY
				EMPTY_FUNC(1+2+3)
				)cpp",
				R"(expanded tokens:
				<empty>
				file './input.cpp'
				spelled tokens:
				# define EMPTY # define EMPTY_FUNC ( X ) EMPTY EMPTY_FUNC ( 1 + 2 + 3 )
				mappings:
				['#'_0, '<eof>'_18) => ['<eof>'_0, '<eof>'_0)
				)"},
				// File ends with a macro replacement.
				{R"cpp(
				#define FOO 10+10;
				int a = FOO
				)cpp",
				R"(expanded tokens:
				int a = 10 + 10 ;
				file './input.cpp'
				spelled tokens:
				# define FOO 10 + 10 ; int a = FOO
				mappings:
				['#'_0, 'int'_7) => ['int'_0, 'int'_0)
				['FOO'_10, '<eof>'_11) => ['10'_3, '<eof>'_7)
				)"}};

				for (auto &Test : TestCases)
				EXPECT_EQ(Test.second, collectAndDump(Test.first))
				<< collectAndDump(Test.first);
				}

				TEST_F(TokenCollectorTest, SpecialTokens) {
				// Tokens coming from concatenations.
				recordTokens(R"cpp(
				#define CONCAT(a, b) a ## b
				int a = CONCAT(1, 2);
				)cpp");
				EXPECT_THAT(std::vector<syntax::Token>(Buffer.expandedTokens()),
				Contains(HasText("12")));
				// Multi-line tokens with slashes at the end.
				recordTokens("i\\\nn\\\nt");
				EXPECT_THAT(Buffer.expandedTokens(),
				ElementsAre(AllOf(Kind(tok::kw_int), HasText("i\\\nn\\\nt")),
				Kind(tok::eof)));
				// FIXME: test tokens with digraphs and UCN identifiers.
				}

				TEST_F(TokenCollectorTest, LateBoundTokens) {
				// The parser eventually breaks the first '>>' into two tokens ('>' and '>'),
				// but we choose to record them as a single token (for now).
				llvm::Annotations Code(R"cpp(
				template <class T>
				struct foo { int a; };
				int bar = foo<foo<int$br[[>>]]().a;
				int baz = 10 $op[[>>]] 2;
				)cpp");
				recordTokens(Code.code());
				EXPECT_THAT(std::vector<syntax::Token>(Buffer.expandedTokens()),
				AllOf(Contains(AllOf(Kind(tok::greatergreater),
				RangeIs(Code.range("br")))),
				Contains(AllOf(Kind(tok::greatergreater),
				RangeIs(Code.range("op"))))));
				}

				TEST_F(TokenCollectorTest, DelayedParsing) {
				llvm::StringLiteral Code = R"cpp(
				struct Foo {
				int method() {
				// Parser will visit method bodies and initializers multiple times, but
				// TokenBuffer should only record the first walk over the tokens;
				return 100;
				}
				int a = 10;

				struct Subclass {
				void foo() {
				Foo().method();
				}
				};
				};
				)cpp";
				std::string ExpectedTokens =
				"expanded tokens:\n"
				" struct Foo { int method ( ) { return 100 ; } int a = 10 ; struct "
				"Subclass { void foo ( ) { Foo ( ) . method ( ) ; } } ; } ;\n";
				EXPECT_THAT(collectAndDump(Code), StartsWith(ExpectedTokens));
				}

				TEST_F(TokenCollectorTest, MultiFile) {
				addFile("./foo.h", R"cpp(
				#define ADD(X, Y) X+Y
				int a = 100;
				#include "bar.h"
				)cpp");
				addFile("./bar.h", R"cpp(
				int b = ADD(1, 2);
				#define MULT(X, Y) X*Y
				)cpp");
				llvm::StringLiteral Code = R"cpp(
				#include "foo.h"
				int c = ADD(1, MULT(2,3));
				)cpp";

				std::string Expected = R"(expanded tokens:
				int a = 100 ; int b = 1 + 2 ; int c = 1 + 2 * 3 ;
				file './input.cpp'
				spelled tokens:
				# include "foo.h" int c = ADD ( 1 , MULT ( 2 , 3 ) ) ;
				mappings:
				['#'_0, 'int'_3) => ['int'_12, 'int'_12)
				['ADD'_6, ';'_17) => ['1'_15, ';'_20)
				file './foo.h'
				spelled tokens:
				# define ADD ( X , Y ) X + Y int a = 100 ; # include "bar.h"
				mappings:
				['#'_0, 'int'_11) => ['int'_0, 'int'_0)
				['#'_16, '<eof>'_19) => ['int'_5, 'int'_5)
				file './bar.h'
				spelled tokens:
				int b = ADD ( 1 , 2 ) ; # define MULT ( X , Y ) X * Y
				mappings:
				['ADD'_3, ';'_9) => ['1'_8, ';'_11)
				['#'_10, '<eof>'_21) => ['int'_12, 'int'_12)
				)";

				EXPECT_EQ(Expected, collectAndDump(Code))
				<< "input: " << Code << "\nresults: " << collectAndDump(Code);
				}

				class TokenBufferTest : public TokenCollectorTest {};

				TEST_F(TokenBufferTest, SpelledByExpanded) {
				recordTokens(R"cpp(
				a1 a2 a3 b1 b2
				)cpp");

				// Sanity check: expanded and spelled tokens are stored separately.
				EXPECT_THAT(findExpanded("a1 a2"), Not(SameRange(findSpelled("a1 a2"))));
				// Searching for subranges of expanded tokens should give the corresponding
				// spelled ones.
				EXPECT_THAT(Buffer.spelledForExpanded(findExpanded("a1 a2 a3 b1 b2")),
				ValueIs(SameRange(findSpelled("a1 a2 a3 b1 b2"))));
				EXPECT_THAT(Buffer.spelledForExpanded(findExpanded("a1 a2 a3")),
				ValueIs(SameRange(findSpelled("a1 a2 a3"))));
				EXPECT_THAT(Buffer.spelledForExpanded(findExpanded("b1 b2")),
				ValueIs(SameRange(findSpelled("b1 b2"))));

				// Test search on simple macro expansions.
				recordTokens(R"cpp(
				#define A a1 a2 a3
				#define B b1 b2

				A split B
				)cpp");
				EXPECT_THAT(Buffer.spelledForExpanded(findExpanded("a1 a2 a3 split b1 b2")),
				ValueIs(SameRange(findSpelled("A split B"))));
				EXPECT_THAT(Buffer.spelledForExpanded(findExpanded("a1 a2 a3")),
				ValueIs(SameRange(findSpelled("A split").drop_back())));
				EXPECT_THAT(Buffer.spelledForExpanded(findExpanded("b1 b2")),
				ValueIs(SameRange(findSpelled("split B").drop_front())));
				// Ranges not fully covering macro invocations should fail.
				EXPECT_EQ(Buffer.spelledForExpanded(findExpanded("a1 a2")), llvm::None);
				EXPECT_EQ(Buffer.spelledForExpanded(findExpanded("b2")), llvm::None);
				EXPECT_EQ(Buffer.spelledForExpanded(findExpanded("a2 a3 split b1 b2")),
				llvm::None);

				// Recursive macro invocations.
				recordTokens(R"cpp(
				#define ID(x) x
				#define B b1 b2

				ID(ID(ID(a1) a2 a3)) split ID(B)
				)cpp");

				EXPECT_THAT(Buffer.spelledForExpanded(findExpanded("a1 a2 a3")),
				ValueIs(SameRange(findSpelled("ID ( ID ( ID ( a1 ) a2 a3 ) )"))));
				EXPECT_THAT(Buffer.spelledForExpanded(findExpanded("b1 b2")),
				ValueIs(SameRange(findSpelled("ID ( B )"))));
				EXPECT_THAT(Buffer.spelledForExpanded(findExpanded("a1 a2 a3 split b1 b2")),
				ValueIs(SameRange(findSpelled(
				"ID ( ID ( ID ( a1 ) a2 a3 ) ) split ID ( B )"))));
				// Ranges crossing macro call boundaries.
				EXPECT_EQ(Buffer.spelledForExpanded(findExpanded("a1 a2 a3 split b1")),
				llvm::None);
				EXPECT_EQ(Buffer.spelledForExpanded(findExpanded("a2 a3 split b1")),
				llvm::None);
				// FIXME: next two examples should map to macro arguments, but currently they
				// fail.
				EXPECT_EQ(Buffer.spelledForExpanded(findExpanded("a2")), llvm::None);
				EXPECT_EQ(Buffer.spelledForExpanded(findExpanded("a1 a2")), llvm::None);

				// Empty macro expansions.
				recordTokens(R"cpp(
				#define EMPTY
				#define ID(X) X

				EMPTY EMPTY ID(1 2 3) EMPTY EMPTY split1
				EMPTY EMPTY ID(4 5 6) split2
				ID(7 8 9) EMPTY EMPTY
				)cpp");
				EXPECT_THAT(Buffer.spelledForExpanded(findExpanded("1 2 3")),
				ValueIs(SameRange(findSpelled("ID ( 1 2 3 )"))));
				EXPECT_THAT(Buffer.spelledForExpanded(findExpanded("4 5 6")),
				ValueIs(SameRange(findSpelled("ID ( 4 5 6 )"))));
				EXPECT_THAT(Buffer.spelledForExpanded(findExpanded("7 8 9")),
				ValueIs(SameRange(findSpelled("ID ( 7 8 9 )"))));

				// Empty mappings coming from various directives.
				recordTokens(R"cpp(
				#define ID(X) X
				ID(1)
				#pragma lalala
				not_mapped
				)cpp");
				EXPECT_THAT(Buffer.spelledForExpanded(findExpanded("not_mapped")),
				ValueIs(SameRange(findSpelled("not_mapped"))));
				}

				} // namespace
				sammccallUnsubmitted Not Done Reply Inline Actions please fix :-) sammccall: please fix :-)
				sammccallUnsubmitted Not Done Reply Inline Actions (still missing?) sammccall: (still missing?)
				ilya-biryukovAuthorUnsubmitted Not Done Reply Inline Actions Will make sure to land this before submitting. ilya-biryukov: Will make sure to land this before submitting.

This is an archive of the discontinued LLVM Phabricator instance.

[Syntax] Introduce TokenBuffer, start clangToolingSyntax libraryClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 200019

clang/include/clang/Tooling/Syntax/Tokens.h

clang/lib/Tooling/CMakeLists.txt

clang/lib/Tooling/Syntax/CMakeLists.txt

clang/lib/Tooling/Syntax/Tokens.cpp

clang/unittests/Tooling/CMakeLists.txt

clang/unittests/Tooling/Syntax/CMakeLists.txt

clang/unittests/Tooling/Syntax/TokensTest.cpp

[Syntax] Introduce TokenBuffer, start clangToolingSyntax library
ClosedPublic