Page MenuHomePhabricator

[Syntax] Introduce syntax trees

Authored by ilya-biryukov on May 7 2019, 3:41 AM.



A tooling-focused alternative to the AST. This commit focuses on the
memory-management strategy and the structure of the AST.

More to follow later:

  • Operations to mutate the syntax trees and corresponding textual replacements.
  • Mapping between clang AST nodes and syntax tree nodes.
  • More node types corresponding to the language constructs.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sammccall added inline comments.May 21 2019, 7:24 AM

if you make this operator<<, then it's slightly more flexible I think (llvm::to_string also works).
It's not as fast, but I don't think that matters here?


I haven't reviewed this file yet :-)

ilya-biryukov added inline comments.May 21 2019, 7:46 AM
38 ↗(On Diff #198427)

The only use-case in my prototype so far is creating token nodes for punctuation nodes, e.g. say you want to create an expr of the form <a>+<b>, where both a and b are existing expressions and you need to synthesize a leaf node for +.
We use this function to synthesize a buffer with the corresponding token.

All the use-cases I can imagine are around synthesizing syntax trees (as opposed to constructing them from the AST).

ilya-biryukov marked 24 inline comments as done.
  • Address comments
Herald added a project: Restricted Project. · View Herald TranscriptJun 3 2019, 9:57 AM

I've addressed most of the comments, except the naming ones.
We need a convention for naming the language nodes and names for composite and leaf structural nodes.

For "language" nodes, I suggest we use CompoundStatement, Recovery, TopLevelDeclaration, TemplateArgumentList, TypeTemplateArgument, etc. That is, we spell out the words in full, no shorthands like Stmt or Expr. That would make things a bit more verbose, but hopefully that helps distinguish from clang AST.

For structural nodes, see the relevant comment threads.


Now part of Tree.h

38 ↗(On Diff #198427)

Renamed to lexBuffer. This does not have usages (yet), we can also remove it from this patch if needed.

40 ↗(On Diff #198427)

Removed in favour of placement new. I guess that also makes it a bit more natural to have separate storage for other nodes that have different lifetime (e.g. use a separate arena).


Changed to the proposed header structure, it looks good.
Had to move Leaf::classof and TreeNode::classof into a .cpp, they need a definition of NodeKind. Keeping those in a header file was a micro-optimization anyway, that's probably not too important.


syntax::Tokens actually looks good, but we should rename the tokens() accessor somehow in that case.
I have only super-generic variants in my head: elements(), items(). Any better ideas?


I'd avoid using macros. As long as the amount of boilerplate is small, it's not too annoying to write.

And it is small for now, we only have a single constructor and a classof per node, so keeping it explicit in this patch seems ok.

We can revisit if more stuff pops up, but I think we do it without extra boilerplate per node for clone, and hopefully for other things too.


Subtree seems ok, although Composite conveys the meaning better to my taste.

Composite does not seem to work without the node suffix, though, and we probably don't want the suffix in other nodes, so I'm torn on this.


Not anymore. The original reason was that Tree.h need NodeKind for implementing casts and the implementation was in a header.


For some reason I thought gdb does not show the enum value names, so I added a named method to simplify debugging.

Turns out it does show the enum names, not just int values, so I'm perfectly happy to the stream output operator.


No more custom containers, explicit tree structure seems to be better.


Please do!

ilya-biryukov added inline comments.Jun 3 2019, 10:02 AM
11 ↗(On Diff #202745)

This needs an update, will do in the next round!

I've addressed most of the comments, except the naming ones.
We need a convention for naming the language nodes and names for composite and leaf structural nodes.

For "language" nodes, I suggest we use CompoundStatement, Recovery, TopLevelDeclaration, TemplateArgumentList, TypeTemplateArgument, etc. That is, we spell out the words in full, no shorthands like Stmt or Expr. That would make things a bit more verbose, but hopefully that helps distinguish from clang AST.



as discussed offline, having a leaf node store a range of tokens (rather than just one) looks attractive now, but as the syntax tree gets more detailed there are relatively few cases where multiple consecutive tokens should really be undifferentiated siblings.

Might be better to bite the bullet now and make leaf hold a single token, so our consecutive token ranges become a linked list. This will also flush out accidental assumptions that only tokens in the same Leaf are adjacent.

Given this, I'm not sure there's a better name than syntax::Leaf. You might consider making it a pointer-like object dereferenceable to Token&.


As discussed, I think syntax::Tree might actually be a better name here.

"A Node is either a Tree or a Leaf" is a bit weird, but not too hard to remember I think.


(curious: why prepend rather than append?)

26 ↗(On Diff #202745)

I find "currently processed" a bit vague. "Pending"?

37 ↗(On Diff #202745)

maybe formNode, formToken, formRoot()?

37 ↗(On Diff #202745)

if syntax nodes strictly nest and we form left-to-right and bottom up, then why are there ever pending nodes that aren't in the range? Is it because we don't aggregate them as early as possible?

(edit: after offline discussion, there are precise invariants here that could be documented and asserted)

41 ↗(On Diff #202745)

Particularly in view of having tokens be 1:1 with Leaf, *constructing* the token nodes as part of higher level constructs / as part of recovery seems a little odd.

What if we constructed all the leaf nodes up front, forming a linked list:
int -> a -> = -> 2 -> + -> 2 -> ; -> eof

When you form a node that covers a range, you splice out the nodes in that range, replacing with the new node:

int -> a -> = -> (2 + 2) -> ; -> eof
(int a = (2 + 2)) -> ; -> eof

Then the invariant is you have a forest, the roots form a linked list, and the trees' respective leaves are a left-to-right partition of the input tokens.

I think this would mean:

  • no separate vector<RangedNode> data structure (AFAICS we can reuse Node)
  • don't have the requirement that the formed node must claim a suffix of the pending nodes, which simplifies the recursive AST visitior

We lose the binary search, but I think tracking the last accessed root (current position) and linear searching left/right will be at least as good in practice, because tree traversal is fundamentally pretty local.

48 ↗(On Diff #202745)

any particular reason learnRoot() and root() are different functions?

If learnRoot() returned TranslationUnit*, then we avoid the need for the caller to know about the dependency, it would track the state itself.

55 ↗(On Diff #202745)

or NodeForRange

92 ↗(On Diff #202745)

So explicitly claiming the primitive tokens, but implicitly claiming the subtrees, seems like a weird mix.
Having both explicit might be nicer:

  • It seems somewhat likely we want to record/tag their semantics (maybe in the child itself, or even the low bits of its pointer?), rather than having accessors scan around looking for something likely.
  • currently when expected subtrees fail to parse, their tokens get (implicitly) wrapped up in Recovery nodes. They're good targets for heuristic parsing, but this probably means we should record what an unexplained range of tokens is supposed to be for.

Thinking of something like:

builder.expectToken(l_brace, S->getLBracLoc());
builder.expectTree(Statement, S->getBody());
builder.expectToken(r_brace, S->getRBracLoc());

where the builder would react to non-null AST nodes by mapping the associated syntax node, and null AST nodes by trying to heuristically parse the tokens in between LBracLoc and RBracLoc.

But lots of unanswered questions here: body is a list of statements, how does that work? What if LBracLoc or RBracLoc is missing? etc.

93 ↗(On Diff #202745)

btw what if LBracLoc or RBracLoc are invalid here due to parser recovery?

135 ↗(On Diff #202745)

this function needs some high-level implementation comments

164 ↗(On Diff #202745)

why can It not point to a node that spans/is past End?

(edit after offline discussion: it's an important invariant that we're always consuming a suffix of the pending nodes)

226 ↗(On Diff #202745)

or It = bsearch(Tokens, [&](const Syntax::Token& L) { return !SM.isBeforeInTranslationUnit(L.location(), TokLoc); })

ilya-biryukov marked 7 inline comments as done.
  • A leaf node stores a single token
  • Restructure code to avoid special-casing leaf nodes

This is not 100% ready yet, but wanted to send it out anyway, as I'll be on vacation until Tuesday.

I've addressed most major comments. In particular, TreeBuilder now looks simpler (and more structured) to my taste.
One thing that's missing is adding children in arbitrary order. It won't be too complicated (would require some thought on how to properly create recovery nodes, though). I'd be tempted to land the current implementation as is and allow adding children in arbitrary order in a separate change (alongside more types of nodes), but let me know what you think.


Leaf now stores a single token, that actually simplifies things quite a bit, thanks!
I'd avoid making it a pointer-like object, given that nodes are often passed as pointers on their own. Making them a pointer-like object would mean we can get code that does double deferences (Tok = **Leaf).


Appending to a linked list is O(n). If we reverse it, traversing left-to-right order is O(n).


Append is O(n) in the current representation as it requires walking to the tail of the list.

41 ↗(On Diff #202745)

I went with a slightly different approach, similar to how parser does it. Please let me know what you think.
The new Forest helper struct ensures the tree structure invariants (all tokens must be covered, nodes must nest properly based on a syntax structure), the rest of the code in tree builder takes care of folding the nodes in a proper order and properly advancing the token stream (it's somewhat similar in the details of how parsers are implemented, except that instead of parsing we actually walk a pre-parsed AST).

It still needs some comments, but I think its intentions should be clear.
Let me know what you think, happy to discuss this offline too.

48 ↗(On Diff #202745)

learnRoot is called inside ast visitor when processing TranslationUnitDecl and root() is used to consume the result.

I guess we could just delay learnRoot until consume() is called, shouldn't be a big deal. I'll do this in the next iteration.

93 ↗(On Diff #202745)

This will currently break and we should definitely fix this

93 ↗(On Diff #202745)

We don't recover from errors properly here, I'd add a FIXME and figure this out later. Does that SG?

The general strategy I would propose is to just skip the tokens (we will return null from the corresponding accessors, etc) and create recovery nodes (RecoveryExpression, etc.) for composite nodes.

  • Do the renames
  • A few more renames and docs
  • Cleanups and comments
  • Reformat the code
ilya-biryukov marked 13 inline comments as done.Jun 21 2019, 5:07 AM

This is ready for another round now.

37 ↗(On Diff #202745)

Sorry, missed this comment and went with expectToken() and expectNode(), root is now built on consume(). Can change to form*, not a big deal.

I still need to write good docs about the invariants here, so leaving this open.

48 ↗(On Diff #202745)

consume() now builds the root node.

92 ↗(On Diff #202745)

That's exactly the design we have one, with a limitation that expect* have to be called in left-to-right and bottom-up manner. Also, you can only build a tree that ends at the tokens that were consumed.

This is actually a reasonable interface to build a tree from an actual parser, but might feel weird for a ast-to-syntax transformation. I need to figure out a way to write good docs about it, but there's a separate comment for that. Marking this as done (although the questions you mentioned at the end are still there)

135 ↗(On Diff #202745)

Done. Also renamed it to consumeNode.

The docs are very short, though, might need a revamp for clarity.

164 ↗(On Diff #202745)

This is now spelled out in the documentation for foldChildren.

ilya-biryukov marked 4 inline comments as done.Jun 21 2019, 5:08 AM

Although there are still rough edges, I believe the storage model is agreed upon and we can hopefully address the rest in the follow-ups.

  • Introduce roles to allow distinguishing the child nodes.
  • Remove recovery node, use an unknown role instead.
  • TreeBuidler now can consume children at any point, not just suffix nodes.
  • Remove (outdated) changes to gn files

This is now in a pretty good shape, I've incorporated changes after our offline discussions about child roles.
The builder interface is also much richer now, removing a requirement that the tree has to be traversed left-to-right (bottom-up is still required!).

sammccall accepted this revision.Jul 1 2019, 7:21 AM

Nice, let's land this!

35 ↗(On Diff #206715)

I don't think TU is actually a declaration. Is there a reason to consider it one from a syntax POV?

41 ↗(On Diff #206715)

I have a slight feeling the EOF token is going to be annoying, e.g. can't just splice stuff in at the end of the list. But not sure if it'll be a big deal, and whether the alternatives are better.

43 ↗(On Diff #206715)

we discussed offline - with 256 values, is it possible to come up with a single role enum that would cover all node types?

Advantage would be that certain logic could be generic (e.g. Recovery could be a role for leaves under any Tree, LParen/RParen/MainKeyword could apply to if, while, switch...)

This revision is now accepted and ready to land.Jul 1 2019, 7:21 AM
ilya-biryukov marked 4 inline comments as done.
  • s/TranslationUnitDeclaration/TranslationUnit
  • Remove accessor from 'eof', add a FIXME to remove it from the tree altogether
ilya-biryukov added inline comments.Jul 8 2019, 10:24 AM
35 ↗(On Diff #206715)

It's a declaration in a sense that it has a corresponding instance of clang::Decl that it "introduces", i.e. the clang::TranslationUnitDecl.

But you are right, TranslationUnit is a better name: this aligns with the C++ grammar from the standard (translation-unit), it does lack similarity with other declarations from the standard (and from clang).

Renamed to TranslationUnit

41 ↗(On Diff #206715)

That's a good point, I actually don't see how eof token in a tree would be useful (it's probably fine to have in the TokenBuffer, though, allows). Moreover, it could cause confusion and bugs when working with multiple translation units (I imagine moving nodes between two different TUs and ending up with multiple eofs somewhere)

I've removed the corresponding accessor from the TranslationUnit node and added a FIXME to remove it from the tree altogether.

43 ↗(On Diff #206715)

Will need to do some estimations to answer this properly, but my gut feeling is that 256 could end up being too limiting in the long run (I would expect each node to have at least one child, so without deduplication we can at least as many roles as we have kinds).

Could imagine a two-level numbering scheme, though:

  • some generic roles like lparen, rparen, etc, take first N roles.
  • higher numbers are for node-specific roles (e.g. LHS or RHS of a BinaryExpr).

But at that point, we probably don't have the benefits of a single enum.

This revision was automatically updated to reflect the committed changes.

@ilya-biryukov We're seeing buildbot failures in SyntaxTests.exe :

Failing Tests (1):

Clang-Unit :: Tooling/Syntax/./SyntaxTests.exe/SyntaxTreeTest.Basic

Sorry about that. That's the same error we had with the previous patch. Will send a fix right away.

RKSimon reopened this revision.Jul 9 2019, 4:30 AM

@ilya-biryukov I'm sorry but I've reverted this at rL365465

This revision is now accepted and ready to land.Jul 9 2019, 4:30 AM
RKSimon requested changes to this revision.Jul 9 2019, 4:30 AM
This revision now requires changes to proceed.Jul 9 2019, 4:30 AM

Relanded in rL365466 with a fix to the crash.

ilya-biryukov abandoned this revision.Jul 9 2019, 4:35 AM
sammccall added inline comments.Jul 9 2019, 6:43 AM
43 ↗(On Diff #206715)

I think we misunderstood each other here... I think this is fairly important, and that we'd agreed on it in offline discussion. Didn't mean to leave it as an optional comment.

I'd be very surprised if 256 were too limiting. Indeed most nodes will have children, but most of them will not have unique roles. (And I would be surprised if we have 200 node types, but maybe not that surprised...).

If there's a more fundamental objection to merging these, I'd like to find some agreement before going further.