Page MenuHomePhabricator

[Syntax] Introduce TokenBuffer, start clangToolingSyntax library
ClosedPublic

Authored by ilya-biryukov on Mar 27 2019, 9:21 AM.

Details

Summary

TokenBuffer stores the list of tokens for a file obtained after
preprocessing. This is a base building block for syntax trees,
see [1] for the full proposal on syntax trees.

This commits also starts a new sub-library of ClangTooling, which
would be the home for the syntax trees and syntax-tree-based refactoring
utilities.

[1]: https://lists.llvm.org/pipermail/cfe-dev/2019-February/061414.html

Diff Detail

Repository
rC Clang

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ilya-biryukov added inline comments.Apr 24 2019, 8:38 AM
clang/include/clang/Tooling/Syntax/Tokens.h
62 ↗(On Diff #194683)

Added a comment. With the comment and an implementation being inline and trivial, I don't think anyone would have trouble understanding what this method does.

78 ↗(On Diff #194683)

You mean to have distinct names for two different overloads?
I wouldn't do it, they both have fairly similar outputs, could add a small comment that the one with SourceManager should probably be preferred if you have one.

95 ↗(On Diff #194683)

SG.
WDYT of exposing the struct members via getters too?

This would mean uniform access for all things (beginOffset, endOffset, length) and adding a ctor that checks invariants (Begin <= End, File.isValid()) would also fit naturally.

111 ↗(On Diff #194683)

Spelled looks fine. The only downside is that it intersects with SourceManager's definition of spelled (which is very similar, but not the same).
Still like it more than raw.

160 ↗(On Diff #194683)

Yes, Expanded should be a subrange of expandedTokens(). Added a comment.

165 ↗(On Diff #194683)

I've changed this to accept a spelled token sequence, but kept it a member function.
This allows to add an assertion that raw tokens come from a corresponding raw token stream.

clang/lib/Tooling/Syntax/Tokens.cpp
127 ↗(On Diff #194683)

Done. I've added a fixme, it's not very detailed though (does not have an example), let me know if you feel it's necessary to add one.

134 ↗(On Diff #194683)

We now have the corresponding assertions in a helper function.

  • s/llvm::unittest::ValueIs/llvm::ValueIs.
  • Add tests for empty macro expansions and directives around macro expansions.
  • Record all gaps between spelled and expanded tokens.
  • Tweak test dump to distinguish different tokens with the same spelling.

@sammccall, could we do another round of review? I think it's perfect now...
(Just kidding :-) )

  • Update a comment
sammccall accepted this revision.Apr 29 2019, 12:54 AM
sammccall marked an inline comment as done.

Rest is details only.

clang/include/clang/Tooling/Syntax/Tokens.h
59 ↗(On Diff #194683)

No. I think I got confused by the explicit clang::Token constructor.

62 ↗(On Diff #194683)

(this is ok. I do think a FileRange accessor would make client code more readable/less error-prone. Let's discuss offline)

78 ↗(On Diff #194683)

No sorry, I meant do we need both str() and operator<<

95 ↗(On Diff #194683)

that sounds fine to me, though I don't feel strongly

75 ↗(On Diff #196849)

there is no "other overload"

clang/lib/Tooling/Syntax/Tokens.cpp
94 ↗(On Diff #195425)

why not just take a ref?

100 ↗(On Diff #195425)

you might want to move the token -> fileID calculation to a helper function (or method on token) called by findRawByExpanded, and then put this function onto MarkedFile.

Reason is, computing the common #include ancestor (and therefore the file buffer to use) can't live in this function, which only sees one of the endpoints.

But this can be deferred until that case is handled...

108 ↗(On Diff #195425)

(renaming ExpandedIndex -> L seems confusing, use same name or just capture?)

171 ↗(On Diff #195425)

nit: "include root" or "include ancestor"?

77 ↗(On Diff #196849)

Nit: I'd suggest moving this all the way to the bottom or something? It's pretty big and seems a bit "in the way" here.

130 ↗(On Diff #196849)

As predicted :-) I think these _<index> suffixes are a maintenance hazard.

In practice, the assertions are likely to add them by copy-pasting the test output.

They guard against a class of error that doesn't seem very likely, and in practice they don't even really prevent it (because of the copy/paste issue).

I'd suggest just dropping them, and living with the test assertions being slightly ambiguous.

Failing that, some slightly trickier formatting could give more context:

A B C D E F --> A B ... E F

With special cases for smaller numbers of tokens. I don't like the irregularity of that approach though.

156 ↗(On Diff #196849)

if you prefer, this is

auto It = llvm::bsearch(File.Mappings,
                        [&](const Mapping &M) { return M.BeginExpanded >= Expanded; });

I *think* this is clear enough that you can drop the comment, though I may be biased.

clang/unittests/Tooling/Syntax/TokensTest.cpp
622 ↗(On Diff #196849)

please fix :-)

This revision is now accepted and ready to land.Apr 29 2019, 12:54 AM
  • Simplify collection of tokens
  • Move dumping code to the bottom of the file
ilya-biryukov marked 2 inline comments as done.Apr 30 2019, 2:24 AM
  • Get only expanded stream from the preprocessor, recompute mappings and spelled stream separately
sammccall added inline comments.May 7 2019, 7:45 AM
clang/lib/Tooling/Syntax/Tokens.cpp
130 ↗(On Diff #196849)

(this one is still open)

216 ↗(On Diff #197803)

what's "$[collect-tokens]"?
Maybe just "Token: "?

224 ↗(On Diff #197803)

this function is now >100 lines long. Can we split it up?

225 ↗(On Diff #197803)

this could be a member, which would help splitting up

226 ↗(On Diff #197803)

Is there a reason we do this at the end instead of as tokens are collected?

243 ↗(On Diff #197803)

consumespelleduntil and fillgapuntil could be methods, I think

277 ↗(On Diff #197803)

the body here could be a method too, I think
i.e. for each (expanded token), process it?

322 ↗(On Diff #197803)

and similarly this section

clang/unittests/Tooling/Syntax/TokensTest.cpp
622 ↗(On Diff #196849)

(still missing?)

ilya-biryukov added inline comments.May 7 2019, 8:36 AM
clang/lib/Tooling/Syntax/Tokens.cpp
130 ↗(On Diff #196849)

Will make sure to do something about it before submitting

clang/unittests/Tooling/Syntax/TokensTest.cpp
622 ↗(On Diff #196849)

Will make sure to land this before submitting.

  • Move the constuction logic to a separate class, split into multiple methods.
ilya-biryukov marked 6 inline comments as done.
  • Move filling the gaps at the end of files to a separate function
ilya-biryukov added inline comments.May 9 2019, 9:52 AM
clang/lib/Tooling/Syntax/Tokens.cpp
130 ↗(On Diff #196849)

As mentioned in the offline conversations, we both agree that we don't want to have ambiguous test-cases.
The proposed alternative was putting the counters of tokens of the same kind in the same token stream, with a goal of making updating test-cases simpler (now one would need to update only indicies of the tokens they changed).

After playing around with the idea, I think this complicates the dumping function too much. The easiest way to update the test is to run it and copy-paste the expected output.
So I'd keep as is and avoiding the extra complexity if that's ok

226 ↗(On Diff #197803)

No strong reason, just wanted to keep the preprocessor callback minimal

ilya-biryukov marked 2 inline comments as done.May 9 2019, 9:53 AM
ilya-biryukov marked 5 inline comments as done.
  • Check invariants on FileRange construction, unify access to all fields
ilya-biryukov marked 3 inline comments as done.May 9 2019, 10:15 AM
ilya-biryukov added inline comments.
clang/include/clang/Tooling/Syntax/Tokens.h
78 ↗(On Diff #194683)

one can type str() in debugger, operator << is for convenience when one is already using streams.

ilya-biryukov marked 2 inline comments as done.May 9 2019, 10:17 AM
ilya-biryukov added inline comments.
clang/include/clang/Tooling/Syntax/Tokens.h
62 ↗(On Diff #194683)

I've added a corresponding accessor (and a version of it that accepts a range of tokens) to D61681.
I'd keep it off this patch for now, will refactor in a follow-up.

ilya-biryukov marked 2 inline comments as done.May 9 2019, 10:18 AM
ilya-biryukov marked 2 inline comments as done.
  • Use bsearch instead of upper_bound
  • Remove the function that maps tokens to offsets
  • Skip annotation tokens, some of them are reported by the preprocessor but we don't want them in the final expanded stream.
  • Add functions to compute FileRange of tokens, add tests for it.
sammccall accepted this revision.May 17 2019, 7:34 AM
sammccall added inline comments.
clang/include/clang/Tooling/Syntax/Tokens.h
49 ↗(On Diff #200030)

nit: character range (just to be totally explicit)?

115 ↗(On Diff #200030)

I think this might need a more explicit name. It's reasonably obvious that this needs to be optional for some cases (token pasting), but it's not obvious at callsites that (or why) we don't use the spelling location for macro-expanded tokens.

One option would be just do that and add an expandedFromMacro() helper so the no-macros version is easy to express too.
If we can't do that, maybe directlySpelledRange or something?

121 ↗(On Diff #200030)

similar to above, I'd naively expect this to return a valid range, given the tokens expanded from assert(X && [[Y.z()]] )

ilya-biryukov added inline comments.May 17 2019, 9:38 AM
clang/include/clang/Tooling/Syntax/Tokens.h
115 ↗(On Diff #200030)

As mentioned in the offline conversation, the idea is that mapping from a token inside a macro expansion to a spelled token should be handled by TokenBuffer and these two functions is really just a convenience helper to get to a range *after* the mapping.

This change has been boiling for some time and I think the other bits of it seem to be non-controversial and agreed upon.
Would it be ok if we land it with this function using a more concrete name (directlySpelledRange or something else) or move it into a separate change?

There's a follow-up change that adds an 'expand macro' action to clangd, which both has a use-site for these method and provides a functional feature based on TokenBuffer. Iterating on the design with (even a single) use-case should be simpler.

sammccall added inline comments.May 17 2019, 9:59 AM
clang/include/clang/Tooling/Syntax/Tokens.h
115 ↗(On Diff #200030)

If we do want to reflect expanded/spelled as types, this will rapidly get harder to change. But we do need to make progress here.

If passing spelled tokens is the intended/well-understood use, let's just assert on that and not return Optional. Then I'm less worried about the name: misuse will be reliably punished.

ilya-biryukov marked 3 inline comments as done.
  • Fix a comment, reformat
  • Use assertions instead of returning optionals
clang/include/clang/Tooling/Syntax/Tokens.h
115 ↗(On Diff #200030)

Added an assertion. Not sure about the name, kept range for now for the lack of a better alternative.

This revision was automatically updated to reflect the committed changes.

Out of interest: The RecursiveASTVisitorTests are part of the ToolingTests binary while this adds a new binary TokensTest. Can you say why?

(No change needed, just curious.)

Another comment: The new binary is called TokensTest but is in a directory "Syntax". For consistency with all other unit test binaries, please either rename the binary to SyntaxTests, or rename the directory to "Tokens". (From the patch description, the former seems more appropriate.) Note the missing trailing "s" in the binary name too.

Out of interest: The RecursiveASTVisitorTests are part of the ToolingTests binary while this adds a new binary TokensTest. Can you say why?

(No change needed, just curious.)

The syntax library is mostly independent from the rest of tooling, so I'd rather keep everything related to it separate including the tests.
I don't think we'll get anything in terms of code reuse from merging them into the same test binary either.

Another comment: The new binary is called TokensTest but is in a directory "Syntax". For consistency with all other unit test binaries, please either rename the binary to SyntaxTests, or rename the directory to "Tokens". (From the patch description, the former seems more appropriate.) Note the missing trailing "s" in the binary name too.

Agree with renaming the binary. In fact, the not-yet-landed revisions also use SyntaxTests, and I should've named it this way from the start. I'll land a patch.