This is an archive of the discontinued LLVM Phabricator instance.

[Syntax] Introduce TokenBuffer, start clangToolingSyntax library
ClosedPublic

Authored by ilya-biryukov on Mar 27 2019, 9:21 AM.

Details

Summary

TokenBuffer stores the list of tokens for a file obtained after
preprocessing. This is a base building block for syntax trees,
see [1] for the full proposal on syntax trees.

This commits also starts a new sub-library of ClangTooling, which
would be the home for the syntax trees and syntax-tree-based refactoring
utilities.

[1]: https://lists.llvm.org/pipermail/cfe-dev/2019-February/061414.html

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Still have a few comments to address in TokenCollector and wrt to naming. But apart from this revision is ready for another round.

clang/include/clang/Tooling/Syntax/TokenBuffer.h
120 ↗(On Diff #192661)

Done. Note that we never map to tokens inside macro definition (the s1 s2 example)

191 ↗(On Diff #192661)

Of course a function called "consume" consumes the result :)

Agreed :-)

clang/lib/Tooling/Syntax/TokenBuffer.cpp
295 ↗(On Diff #192456)

Went with BeginExpandedToken, EndExpandedToken.

clang/unittests/Tooling/Syntax/TokenBufferTest.cpp
429 ↗(On Diff #192661)

Thanks. Much nicer with a function that finds by text.

190 ↗(On Diff #192456)

The declarations of Actual and Expected are really close, both types are easy to infer

incomplete, haven't reviewed token collector

clang/include/clang/Tooling/Syntax/TokenBuffer.h
1 ↗(On Diff #192839)

are you sure TokenBuffer is the central concept in this file, rather than just the thing with the most code? Token.h might end up being a better name for users.

8 ↗(On Diff #192839)

file comment?
sometimes they're covered by the class comments, but I think there's a bit to say here.
in particular the logical model (how we model the preprocessor, the two token streams and types of mappings between them) might go here.

32 ↗(On Diff #192839)

what's the default state (and why have one?) do you need a way to query whether a token is "valid"?
(I'd avoid just relying on length() == 0 or location().isInvalid() because it's not obvious to callers this can happen)

58 ↗(On Diff #192839)

unresonably -> unreasonably

60 ↗(On Diff #192839)

If you're going to say "invocation" in the name, say "use" in the comment (or vice versa!)

74 ↗(On Diff #192839)

It seems weirdly non-orthogonal to be able to get the range but not the file ID.

I'd suggest either adding another accessor for that, or returning a struct BufferRange { FileID File; unsigned Begin, End; } (with a better name.)

I favor the latter, because .first and .second don't offer the reader semantic hints at the callsite, and because that gives a nice place to document the half-open nature of the interval.

Actually, I'd suggest adding such a struct accessor to Token too.

87 ↗(On Diff #192839)

please rename along with macroTokens() function

92 ↗(On Diff #192839)

Warning, braindump follows - let's discuss further.
We talked a bunch offline about the logical and concrete data model here.

As things stand:

  • #includes are not expanded, but will refer to a file ID with its own buffer: map<FileID, TokenBuffer> is the whole structure
  • no information is captured about other PP interactions (PP directives that generate no tokens, skipped sections
  • the spelled sequence of tokens is not directly available (but expanded + macro invocations may be enough to reconstruct it)

If we keep this model, I think we should spell both of these out in the comment.
But there's another fairly natural model we could consider:

  • we have a single TokenBuffer with a flat list of all expanded tokens in the TU, rather than for the file
  • one FileID corresponds to a contiguous range of *transitive* symbols, these ranges nest
  • the TokenBuffer also stores the original tokens as map<FileID, vector<Token>>
  • spelled -> expanded token mapping: spelled tokens for a file are partitioned into ranges (types: literal, include, macro invocation, other pp directive, pp skipped region). each range maps onto a range of expanded tokens (empty for e.g. pp directive or skipped region)
  • expanded -> spelled token is similar. (It's almost the inverse of the of the other mapping, we drop the "include" mapping ranges)
  • This can naturally handle comments, preprocessor directives, pp skipped sections, etc - these are in the spelled stream but not the expanded stream.

e.g. for this code

// foo.cpp
int a;
#include "foo.h"
int b =  FOO(42);
// foo.h
#define FOO(X) X*X
int c;

we'd have this buffer:

expandedTokens = [int a ; int c ; int b = 42 * 42 ;]
spelledTokens = {
  "foo.cpp" => [int a; # include "foo.h" int b = FOO ( 42 ) ;],
  "foo.h" => [# define FOO ( X ) X * X int c ;]
}

expandedToSpelling = {
  int => int (foo.cpp), type = literal
  a => a
  ; => ;

  [] => [# define FOO ( X ) X * X](foo.h), type=preprocessor directive
  int => int (foo.h)
  c => c
  ; => ;

  int => int (foo.cpp)
  b => b
  = => =
  [42 * 42] => [FOO ( 42 ) ](foo.cpp), type=macro invocation
  ; => ; (foo.cpp)
}

spellingToExpanded = {
  // foo.cpp
  int => int, type=literal
  a => a
  ; => ;
  [# include "foo.h"] => [int c ;], type=include
  int => int
  b => b
  = => =
  [FOO ( X )] => [42 * 42], type=macro invocation
  ; => ;

  // foo.h
  [# define FOO ( X ) X] => [], type=preprocessor directive
  int => int
  c => c
  ; => ;
}

Various implementation strategies possible here, one obvious one is to use a flat sorted list, and include a sequence of literal tokens as a single entry.

struct ExpandedSpellingMapping {
  unsigned ExpandedStart, ExpandedEnd;
  FileID SpellingFile; // redundant for illustration: can actually derive from SpellingTokens[SpellingStart].location()
  unsigned SpellingStart, SpellingEnd;
  enum { LiteralSequence, MacroInvocation, Preprocessor, PPSkipped, Inclusion } Kind;
}
vector<ExpandedSpellingMapping> ExpandedToSpelling; // bsearchable
vector<pair<FileID, ExpandedSpellingMapping>> Inclusions; // for spelling -> expanded mapping. redundant: much can be taken from SourceManager.

A delta-encoded representation with chunks for bsearch will likely be much more compact, if this proves large.
(Kind of similar to the compressed posting lists in clangd Dex)

But as-is, the mappings for the example file would be:

Expanded = {
  {0, 3, File0, 0, 3, LiteralSequence}, // int a;
  {3, 3, File1, 0, 8, Preprocessor}, // #define FOO(X) X * X
  {3, 6, File1, 8, 11, LiteralSequence}, // int c;
  {6, 9, File0, 6, 9, LiteralSequence}, // int b =
  {9, 12, File0, 9, 13, MacroExpansion}, // FOO(42)
  {12, 13, File0, 13, 14, LiteralSequence}, // ;
}
Inclusions = {
  {File1, {3, 6, File0, 3, 6, Inclusion}}, // #include 
}
99 ↗(On Diff #192839)

expandedtokens

104 ↗(On Diff #192839)

(this example is confusing, there is a 10 in the file!)

117 ↗(On Diff #192839)

, preprocessor directives are parsed but not interpreted

If using the model above, this means that the spelled and expanded streams are identical. (Or are we still going to strip comments?)

146 ↗(On Diff #192839)

I'd think this would map a character range onto a character range, or a token range onto a token range - is there some reason otherwise?

171 ↗(On Diff #192839)

not sure quite what this is for, but "tokens not in expanded stream" might include header-guarded #includes, comments, ifdef-'d sections, #defines...
is this API intended to be a whitelist (macro invocations specifically) or a blacklist (everything that's not literally in the expanded stream)

72 ↗(On Diff #192661)

I think your reply applies to TokenBuffer::macroTokens(), not MacroInvocation::macroTokens().

+1 to invocationTokens here.

74 ↗(On Diff #192661)

I'd personally prefer invocationRange for symmetry with invocationTokens (which probably can't just be called tokens). But range is OK.
Please document half-openness (shouldn't be necessary, but this is clang after all).

Will address the rest of the comments later, answering a few questions that popped up first.

clang/include/clang/Tooling/Syntax/TokenBuffer.h
1 ↗(On Diff #192839)

I don't mind changing this to Token.h, although I'd personally expect that a file with this name only contains a definition for the token class.
Tokens.h would be a better fit from my POV. WDYT?

92 ↗(On Diff #192839)

Thanks for raising the multi-file issue earlier rather than later. My original intention was to model TokenBuffer as a stream of expanded tokens for a single FileID, i.e. we would have multiple TokenBuffers for multiple files.

Your suggestion of having a single set of expanded tokens and a map from FileID to vectors of raw tokens (i.e. before preprocessing) looks very nice. A single expanded token stream is definitely a better model for the syntax trees (each node of a tree would cover a continuous subrange of the expanded token stream).

In short, here are the highlights of the proposed model that I find most notable are:

  • store a single stream of expanded tokens for a TU.
  • store raw tokens for each file (mapped either by FileID or FileEntry).
  • store information about preprocessor directives and macro replacements in each FileID (i.e. MacroInvocation in this patch)
  • support an operation of mapping a subrange of expanded tokens into the subrange of raw-tokens in a particular FileID. This operation might fail.

Let me know if I missed something.

146 ↗(On Diff #192839)

No particular reason, this was driven by its usage in function computing replacements (not in this patch, see github prototype, if interested, but keep in mind the prototype is not ready for review yet).

Mapping to a range of "raw" tokens seems more consistent with our model, will update accordingly, obtaining a range is easy after one has tokens from the file.

171 ↗(On Diff #192839)

This was supposed to be all raw tokens in a file, which are not part of the expanded token stream, i.e. all PP directives, macro-replaced-tokens (object-like and function-like macros), tokens of skipped else branches, etc.

In your proposed model we would store raw tokens of a file separately and this would not be necessary.

72 ↗(On Diff #192661)

Yeah, sorry, I have mixed up these two functions with the same name. Will change to invocationTokens.

sammccall added inline comments.Apr 2 2019, 6:06 AM
clang/include/clang/Tooling/Syntax/TokenBuffer.h
1 ↗(On Diff #192839)

Sure, Tokens.h SGTM.

66 ↗(On Diff #192839)

I'd suggest a more natural order is token -> tokenbuffer -> macroinvocation. Forward declare?

This is because the token buffer exposes token streams, which are probably the second-most important concept in the file (after tokens)

92 ↗(On Diff #192839)

Yes, I think we're on the same page.

Some more details:

store a single stream of expanded tokens for a TU.

BTW I don't think we're giving up the ability to tokenize only a subset of files. If we filter out files, we can just omit their spelled token stream and omit their tokens from the expanded stream. There's complexity if we want to reject a file and accept its includes, but I think the most important case is "tokenize the main file only" where that's not an issue.

store information about preprocessor directives and macro replacements in each FileID

I do think it's useful to treat these as similar things - both for the implementation of mapping between streams, and for users of the API. My hope is that most callsites won't have to consider issues of macros, PP directives, ifdef'd code, etc as separate issues.

support an operation of mapping a subrange of expanded tokens into the subrange of raw-tokens in a particular FileID. This operation might fail

Yes, though I don't have a great intuition for what the API should look like

  • if you select part of an expansion, should the result include the macro invocation, exclude, or fail?
  • if the expansion range ends at a zero-size expansion (like a PP directive), should it be included or excluded?
  • if the result spans files, is that a failure or do we represent it somehow?

Some of these probably need to be configurable
(We also have the mapping in the other direction, which shouldn't fail)

clang/lib/Tooling/Syntax/TokenBuffer.cpp
42 ↗(On Diff #192839)

this is recognizing keywords, right?
add a comment?

70 ↗(On Diff #192839)

I suspect this state isn't needed if we use a single tokenbuffer for all files

Not sure what the implications of design changes are, so will defer reviewing details of tokencollector (which generally looks good, but is of course highly coupled to lexer/pp) and range mapping (which I suspect could be simplified, but depends heavily on the model).

clang/lib/Tooling/Syntax/TokenBuffer.cpp
61 ↗(On Diff #192839)

I'm afraid this really needs a class comment describing what this is supposed to do (easy!) and the implementation strategy (hard!)

In particular, it looks like macros like this:

  • we expect to see the tokens making up the macro invocation first (... FOO ( 42 ))
  • then we see MacroExpands which allows us to recognize the last N tokens are a macro invocation. We create a MacroInvocation object, and remember where the invocation ends
  • then we see the tokens making up the macro expansion
  • finally, once we see the next token after the invocation, we finalize the MacroInvocation.
285 ↗(On Diff #192839)

is there a problem if we never see another token in that file after the expansion?
(or do we see the eof token in this case?)

ilya-biryukov marked 2 inline comments as done.

Changes:

  • Add multi-file support, record a single expanded stream and per-file-id raw token streams and mappings.
  • Rename MacroInvocation to TokenBuffer::Mapping, make it private.
  • Simplify TokenCollector, let preprocessor handle some more stuff.

TODO:

  • update the docs
  • go through other comments again
  • write more tests
ilya-biryukov marked 19 inline comments as done.
  • Remove a spamy debug tracing output.
  • Less debug output, move stuff around, more comments.
  • Add methods that map expanded tokens to raw tokens.
  • Rename toOffsetsRange.
  • Remove default ctor of Token.
  • Introduce a struct for storing FileID and a pair of offsets.
  • Update comments.
  • Add a test for macro expansion at the end of the file.
  • Misc fixes.
  • Fix header comments.

The new version address most of the comments, there are just a few left in the code doing the mapping from Dmitri, I'll look into simplifying and removing the possibly redundant checks.
Please take a look at this version, this is very close to the final state.

clang/include/clang/Tooling/Syntax/TokenBuffer.h
8 ↗(On Diff #192839)

Added a file comment explaining the basic concepts inside the file.

32 ↗(On Diff #192839)

Got rid of default state altogether. I haven' seen the use-cases where it's important so far.
Using Optional<Token> for invalid tokens seems like a cleaner design anyway (we did not need it so far).

60 ↗(On Diff #192839)

The class is now internal to TokenBuffer and is called Mapping.

66 ↗(On Diff #192839)

Done. I forgot that you could have members of type vector<T> where T is incomplete.

74 ↗(On Diff #192839)

Done. I've used the name FileRange instead (the idea is that it's pointing to a substring in a file).
Let me know what you think about the name.

87 ↗(On Diff #192839)

This is now BeginRawToken and EndRawToken.
As usually, open to suggestions wrt to naming.

92 ↗(On Diff #192839)

Yes, though I don't have a great intuition for what the API should look like

Agree, and I think there's room for all of these options. For the sake of simplicity in the initial implementation, I would simply pick the set of trade-offs that work for moving macro calls that span full syntax subtrees around and document them.

When the use-cases pop up, we could add more configuration, refactor the APIs, etc.

117 ↗(On Diff #192839)

Removed this constructor altogether, it does not make much sense in the new model.
Instead, tokenize() now returns the raw tokens directly in vector<Token>.

146 ↗(On Diff #192839)

Added a method to map from an expanded token range to a raw token range.
Kept the method that maps an expanded token range to a character range too.

Note that we don't currently have a way to map from raw token range to expanded, that could be added later, we don't need it for the initial use-case (map the ranges coming from AST nodes into raw source ranges), so I left it out for now.

72 ↗(On Diff #192661)

The Mapping is now internal and only stores the indicies.
The names for two kinds of those are "raw" and "expanded", happy to consider alternatives for both.

clang/lib/Tooling/Syntax/TokenBuffer.cpp
61 ↗(On Diff #192839)

Added a comment. The model is now a bit simpler (more work is done by the preprocessor), but let me know if the comment is still unclear and could be improved.

285 ↗(On Diff #192839)

Yeah, there should always be an eof.
Added a comment and a test for this.

  • Store a source manager in a token buffer
sammccall added inline comments.Apr 16 2019, 7:59 AM
clang/include/clang/Tooling/Syntax/TokenBuffer.h
146 ↗(On Diff #192839)

Note that we don't currently have a way to map from raw token range to expanded, that could be added later, we don't need it for the initial use-case (map the ranges coming from AST nodes into raw source ranges), so I left it out for now.

That seems fine, can we add a comment somewhere (class header)?
Not exactly a FIXME, but it would clarify that this is in principle a bidirectional mapping, just missing some implementation.

clang/include/clang/Tooling/Syntax/Tokens.h
49

not needed

60

Token should be copyable I think?

63

sadly, this would need documentation - it's the one-past-the-end location, not the last character in the token, not the location of the "next" token, or the location of the "next" character...

I do wonder whether this is actually the right function to expose...
Do you ever care about end but not start? (The reverse seems likelier). Having two independent source location accessors obscures the invariant that they have the same file ID.

I think exposing a FileRange accessor instead might be better, but for now you could also make callers use Tok.location().getLocWithOffset(Tok.length()) until we know it's the right pattern to encourage

79

(do we need to have two names for this version?)

96

may want to offer length() and StringRef text(const SourceManager&)

112

I think "Spelled" would be a better name than "Raw" here. (Despite being longer and grammatically awkward)

  • The spelled/expanded dichotomy is well-established in clang-land, and this will help understand how they relate to each other and why the difference is important. It's true we'd be extending the meaning a bit (all PP activity, not just macros), but that applies to Expanded too. I think consistency is valuable here.
  • "Raw" regarding tokens has an established meaning (raw lexer mode, raw_identifier etc) that AIUI you're not using here, but plausibly could be. So I think this may cause some confusion.

There's like 100 occurrences of "raw" in this patch, happy to discuss first to avoid churn.

134

It's not clear from this comment what the current behavior *is*, and how it's exceptional.
(It's a FIXME, but these sometimes last a while...)

134

nit: unclear from comment whether this is the only exception or just the only notable one

141

"cannot be determined unambiguously" is a little vague: "doesn't exactly correspond to a sequence of raw tokens"?

161

Not totally clear what the legal arguments are here:

  • any sequence of tokens?
  • a subrange of expandedTokens() (must point into that array) - most obvious, but in this case why does mapping an empty range always fail?
  • any subrange of expandedTokens(), compared by value? - I think this is what the implementation assumes
161

I think by is a little weak semantically, I think it just means "the parameter is expanded tokens".

findRawForExpanded() seems clearer.
rawTokensForExpanded() would be clearer still I think.

166

this still seems a bit non-orthogonal:

  • it fails if and only if findRawByExpanded fails, but that isn't visible to callers
  • there's no way to avoid the redundant work of doing the mapping twice
  • the name doesn't indicate that it maps to raw locations as well as translating tokens to locations
  • seems likely to lead to combinatorial explosion, e.g. if we want a pair of expansion locations to a file range, or expansion locations to raw token range...

Can this be a free function taking the *raw* token range instead, to be composed with findRawByExpanded? (Not sure if it belongs in this file or if it's a test helper)

clang/lib/Tooling/Syntax/Tokens.cpp
121

can you explain why it's not? it *almost* is

128

This sounds like an implementation limitation, rather than a desired part of the contract.
maybe e.g.
` If the tokens in the range don't come from the same file, raw token mapping isn't defined.
Because files span contiguous token ranges, if the first/last token have the same file, so do the ones in between.`

(nit: no "the")

128

The part that is an implementation limitation, as discussed offline:

#include "bar.h"
int foo;

expands to

int bar;
int foo;

but if you try to map those tokens back, to the file it'll fail due to file ID mismatch between first and last token.

Whereas the following will map backwards and forwards fine:

int foo1;
#include "bar.h";
int foo2;

So I think requiring the fileIDs of begin/end to match is a bug, and instead we should walk up the #include stack to the closest common file id (while requiring that the range cover the whole #include).

No need to implement this yet, but I think the case is worth documenting.

135

ah, I missed the pointer calculation here. Add an assert at the top of the function that the arrayref is in-range?

150

eeee std::lower_bound...

as discussed offline, can reduce to one call with a helper?
Something like:
pair<const Token*, const Mapping*> MarkedFile::rawForExpanded(const Token&)

This would bsearch to find the relevant mapping.
(If the token isn't part of any mapping, you can find it by index arithmetic anchored on the next mapping/eof - so the second bsearch isn't necessary)

Then you can call this twice for the first/last token (gaining symmetry by converting to an open range). This yields a token range, and the returned mappings can be verified (whole mapping should be covered).

234

can we add a FIXME for the things that aren't recorded?

  • includes
  • (other) preprocessor directives
  • pp skipped sections
clang/unittests/Tooling/Syntax/TokensTest.cpp
2

A few high level things discussed offline:

  • can we more clearly separate out tests of the token collector (testing the value of the token buffer returned) vs those of the tokenbuffer (testing the tokenbuffer's behavior)
  • the token collector tests might be more clearly/tersely expressed as an exact assertion on a string representation of the tokenbuffer (maybe a special-purpose one)
  • Simplify rawByExpanded by using a helper function.
ilya-biryukov marked 53 inline comments as done.
  • Simplify rawByExpanded by using a helper function.
  • Add a FIXME to add spelled-to-expanded mapping
  • s/raw*/spelled*
  • Split token collector and token buffer tests
  • Rewrite dump function for use in tests
  • Simplify tests by using a dumpForTests function
  • Address other comments
ilya-biryukov added inline comments.
clang/include/clang/Tooling/Syntax/Tokens.h
60

Sure, they are copyable. Am I missing something?

63

Added a comment. With the comment and an implementation being inline and trivial, I don't think anyone would have trouble understanding what this method does.

79

You mean to have distinct names for two different overloads?
I wouldn't do it, they both have fairly similar outputs, could add a small comment that the one with SourceManager should probably be preferred if you have one.

96

SG.
WDYT of exposing the struct members via getters too?

This would mean uniform access for all things (beginOffset, endOffset, length) and adding a ctor that checks invariants (Begin <= End, File.isValid()) would also fit naturally.

112

Spelled looks fine. The only downside is that it intersects with SourceManager's definition of spelled (which is very similar, but not the same).
Still like it more than raw.

161

Yes, Expanded should be a subrange of expandedTokens(). Added a comment.

166

I've changed this to accept a spelled token sequence, but kept it a member function.
This allows to add an assertion that raw tokens come from a corresponding raw token stream.

clang/lib/Tooling/Syntax/Tokens.cpp
128

Done. I've added a fixme, it's not very detailed though (does not have an example), let me know if you feel it's necessary to add one.

ilya-biryukov added inline comments.Apr 24 2019, 8:38 AM
clang/lib/Tooling/Syntax/Tokens.cpp
135

We now have the corresponding assertions in a helper function.

  • s/llvm::unittest::ValueIs/llvm::ValueIs.
  • Add tests for empty macro expansions and directives around macro expansions.
  • Record all gaps between spelled and expanded tokens.
  • Tweak test dump to distinguish different tokens with the same spelling.

@sammccall, could we do another round of review? I think it's perfect now...
(Just kidding :-) )

  • Update a comment
sammccall accepted this revision.Apr 29 2019, 12:54 AM
sammccall marked an inline comment as done.

Rest is details only.

clang/include/clang/Tooling/Syntax/Tokens.h
60

No. I think I got confused by the explicit clang::Token constructor.

63

(this is ok. I do think a FileRange accessor would make client code more readable/less error-prone. Let's discuss offline)

76

there is no "other overload"

79

No sorry, I meant do we need both str() and operator<<

96

that sounds fine to me, though I don't feel strongly

clang/lib/Tooling/Syntax/Tokens.cpp
78

Nit: I'd suggest moving this all the way to the bottom or something? It's pretty big and seems a bit "in the way" here.

95

why not just take a ref?

101

you might want to move the token -> fileID calculation to a helper function (or method on token) called by findRawByExpanded, and then put this function onto MarkedFile.

Reason is, computing the common #include ancestor (and therefore the file buffer to use) can't live in this function, which only sees one of the endpoints.

But this can be deferred until that case is handled...

109

(renaming ExpandedIndex -> L seems confusing, use same name or just capture?)

131

As predicted :-) I think these _<index> suffixes are a maintenance hazard.

In practice, the assertions are likely to add them by copy-pasting the test output.

They guard against a class of error that doesn't seem very likely, and in practice they don't even really prevent it (because of the copy/paste issue).

I'd suggest just dropping them, and living with the test assertions being slightly ambiguous.

Failing that, some slightly trickier formatting could give more context:

A B C D E F --> A B ... E F

With special cases for smaller numbers of tokens. I don't like the irregularity of that approach though.

157

if you prefer, this is

auto It = llvm::bsearch(File.Mappings,
                        [&](const Mapping &M) { return M.BeginExpanded >= Expanded; });

I *think* this is clear enough that you can drop the comment, though I may be biased.

172

nit: "include root" or "include ancestor"?

clang/unittests/Tooling/Syntax/TokensTest.cpp
623

please fix :-)

This revision is now accepted and ready to land.Apr 29 2019, 12:54 AM
  • Simplify collection of tokens
  • Move dumping code to the bottom of the file
ilya-biryukov marked 2 inline comments as done.Apr 30 2019, 2:24 AM
  • Get only expanded stream from the preprocessor, recompute mappings and spelled stream separately
sammccall added inline comments.May 7 2019, 7:45 AM
clang/lib/Tooling/Syntax/Tokens.cpp
131

(this one is still open)

217

what's "$[collect-tokens]"?
Maybe just "Token: "?

225

this function is now >100 lines long. Can we split it up?

226

this could be a member, which would help splitting up

227

Is there a reason we do this at the end instead of as tokens are collected?

244

consumespelleduntil and fillgapuntil could be methods, I think

278

the body here could be a method too, I think
i.e. for each (expanded token), process it?

323

and similarly this section

clang/unittests/Tooling/Syntax/TokensTest.cpp
623

(still missing?)

ilya-biryukov added inline comments.May 7 2019, 8:36 AM
clang/lib/Tooling/Syntax/Tokens.cpp
131

Will make sure to do something about it before submitting

clang/unittests/Tooling/Syntax/TokensTest.cpp
623

Will make sure to land this before submitting.

  • Move the constuction logic to a separate class, split into multiple methods.
ilya-biryukov marked 6 inline comments as done.
  • Move filling the gaps at the end of files to a separate function
ilya-biryukov added inline comments.May 9 2019, 9:52 AM
clang/lib/Tooling/Syntax/Tokens.cpp
131

As mentioned in the offline conversations, we both agree that we don't want to have ambiguous test-cases.
The proposed alternative was putting the counters of tokens of the same kind in the same token stream, with a goal of making updating test-cases simpler (now one would need to update only indicies of the tokens they changed).

After playing around with the idea, I think this complicates the dumping function too much. The easiest way to update the test is to run it and copy-paste the expected output.
So I'd keep as is and avoiding the extra complexity if that's ok

227

No strong reason, just wanted to keep the preprocessor callback minimal

ilya-biryukov marked 2 inline comments as done.May 9 2019, 9:53 AM
ilya-biryukov marked 5 inline comments as done.
  • Check invariants on FileRange construction, unify access to all fields
ilya-biryukov marked 3 inline comments as done.May 9 2019, 10:15 AM
ilya-biryukov added inline comments.
clang/include/clang/Tooling/Syntax/Tokens.h
79

one can type str() in debugger, operator << is for convenience when one is already using streams.

ilya-biryukov marked 2 inline comments as done.May 9 2019, 10:17 AM
ilya-biryukov added inline comments.
clang/include/clang/Tooling/Syntax/Tokens.h
63

I've added a corresponding accessor (and a version of it that accepts a range of tokens) to D61681.
I'd keep it off this patch for now, will refactor in a follow-up.

ilya-biryukov marked 2 inline comments as done.May 9 2019, 10:18 AM
ilya-biryukov marked 2 inline comments as done.
  • Use bsearch instead of upper_bound
  • Remove the function that maps tokens to offsets
  • Skip annotation tokens, some of them are reported by the preprocessor but we don't want them in the final expanded stream.
  • Add functions to compute FileRange of tokens, add tests for it.
sammccall accepted this revision.May 17 2019, 7:34 AM
sammccall added inline comments.
clang/include/clang/Tooling/Syntax/Tokens.h
50

nit: character range (just to be totally explicit)?

116

I think this might need a more explicit name. It's reasonably obvious that this needs to be optional for some cases (token pasting), but it's not obvious at callsites that (or why) we don't use the spelling location for macro-expanded tokens.

One option would be just do that and add an expandedFromMacro() helper so the no-macros version is easy to express too.
If we can't do that, maybe directlySpelledRange or something?

122

similar to above, I'd naively expect this to return a valid range, given the tokens expanded from assert(X && [[Y.z()]] )

ilya-biryukov added inline comments.May 17 2019, 9:38 AM
clang/include/clang/Tooling/Syntax/Tokens.h
116

As mentioned in the offline conversation, the idea is that mapping from a token inside a macro expansion to a spelled token should be handled by TokenBuffer and these two functions is really just a convenience helper to get to a range *after* the mapping.

This change has been boiling for some time and I think the other bits of it seem to be non-controversial and agreed upon.
Would it be ok if we land it with this function using a more concrete name (directlySpelledRange or something else) or move it into a separate change?

There's a follow-up change that adds an 'expand macro' action to clangd, which both has a use-site for these method and provides a functional feature based on TokenBuffer. Iterating on the design with (even a single) use-case should be simpler.

sammccall added inline comments.May 17 2019, 9:59 AM
clang/include/clang/Tooling/Syntax/Tokens.h
116

If we do want to reflect expanded/spelled as types, this will rapidly get harder to change. But we do need to make progress here.

If passing spelled tokens is the intended/well-understood use, let's just assert on that and not return Optional. Then I'm less worried about the name: misuse will be reliably punished.

ilya-biryukov marked 3 inline comments as done.
  • Fix a comment, reformat
  • Use assertions instead of returning optionals
clang/include/clang/Tooling/Syntax/Tokens.h
116

Added an assertion. Not sure about the name, kept range for now for the lack of a better alternative.

This revision was automatically updated to reflect the committed changes.
thakis added a subscriber: thakis.May 20 2019, 12:17 PM

Out of interest: The RecursiveASTVisitorTests are part of the ToolingTests binary while this adds a new binary TokensTest. Can you say why?

(No change needed, just curious.)

Another comment: The new binary is called TokensTest but is in a directory "Syntax". For consistency with all other unit test binaries, please either rename the binary to SyntaxTests, or rename the directory to "Tokens". (From the patch description, the former seems more appropriate.) Note the missing trailing "s" in the binary name too.

Out of interest: The RecursiveASTVisitorTests are part of the ToolingTests binary while this adds a new binary TokensTest. Can you say why?

(No change needed, just curious.)

The syntax library is mostly independent from the rest of tooling, so I'd rather keep everything related to it separate including the tests.
I don't think we'll get anything in terms of code reuse from merging them into the same test binary either.

Another comment: The new binary is called TokensTest but is in a directory "Syntax". For consistency with all other unit test binaries, please either rename the binary to SyntaxTests, or rename the directory to "Tokens". (From the patch description, the former seems more appropriate.) Note the missing trailing "s" in the binary name too.

Agree with renaming the binary. In fact, the not-yet-landed revisions also use SyntaxTests, and I should've named it this way from the start. I'll land a patch.

nridge added a subscriber: nridge.Jan 9 2020, 3:24 PM
nridge added inline comments.
include/clang/Tooling/Syntax/Tokens.h
206 ↗(On Diff #200260)

Is this comment correct?

If so:

  • Why are the tokens "int", "name", "=", "10" not included?
  • Why are the tokens `"DECL", "(", "a", ")" not included?
ilya-biryukov marked 2 inline comments as done.Jan 9 2020, 11:07 PM
ilya-biryukov added inline comments.
include/clang/Tooling/Syntax/Tokens.h
206 ↗(On Diff #200260)

It isn't. Thanks for catching that.
This patch went through so many revisions, it was close to impossible to track this down.

ilya-biryukov marked 2 inline comments as done.Jan 9 2020, 11:17 PM
ilya-biryukov added inline comments.
include/clang/Tooling/Syntax/Tokens.h
206 ↗(On Diff #200260)

Also:

  • we do not store 'eof' in the spelled tokens anymore
  • the FIXME is stale, we do store tokens of macro directives now

759c90456d418ffe69e1a2b4bcea2792491a6b5a updates the comment.