This is an archive of the discontinued LLVM Phabricator instance.

[Syntax] Introduce syntax trees
AbandonedPublic

Authored by ilya-biryukov on May 7 2019, 3:41 AM.

Details

Summary

A tooling-focused alternative to the AST. This commit focuses on the
memory-management strategy and the structure of the AST.

More to follow later:

  • Operations to mutate the syntax trees and corresponding textual replacements.
  • Mapping between clang AST nodes and syntax tree nodes.
  • More node types corresponding to the language constructs.

Event Timeline

ilya-biryukov created this revision.May 7 2019, 3:41 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 7 2019, 3:41 AM
Herald added a subscriber: mgorny. · View Herald Transcript
sammccall added inline comments.May 7 2019, 8:40 AM
clang/include/clang/Tooling/Syntax/Corpus.h
23 ↗(On Diff #198427)

I think plain SyntaxArena might be a better name here :-/
Corpus refers to texts (the use in dex is by analogy, as we call symbols "documents" from search).

26 ↗(On Diff #198427)

MainFile is presumably the whole TU, name might need a tweak.
Can it be empty?
The relationship between Corpus and TokenBuffer seems a little weird. Why is it needed?

38 ↗(On Diff #198427)

Are you planning to have a way to add tokens directly? Having to turn them into text and re-lex them seems like it might be inconvenient.

40 ↗(On Diff #198427)

Now there's two ways to do this: new (C.allocator()) T(...) or C.construct<T>(...). Do we need both?

(If we do, is the syntax new (C) T(...) more natural?)

clang/include/clang/Tooling/Syntax/Tree.h
35

for a translation unit? or for only decls within the main file?

41

I've been burned with adding these APIs without use cases.

It seems likely you want a way to:

  • skip traversal of children
  • abort the traversal entirely
clang/include/clang/Tooling/Syntax/Tree/Cascade.h
2

this is Cascade.h, not tree.h

2

why "cascade"?

2

The Tree/ subdirectory seems superfluous - why are these separate from Syntax/?

75

This use of "tree node" to mean specifically internal node seems confusing - is it common?

ilya-biryukov added inline comments.May 7 2019, 10:20 AM
clang/include/clang/Tooling/Syntax/Tree.h
41

Not having an option to abort traversal protects us against timing attacks...

Agree with both, will address in this patch.

ilya-biryukov marked 5 inline comments as done.
  • Make traverse() internal to its only use-site.
  • s/Corpus/Arena.
  • Address some other comments.
clang/include/clang/Tooling/Syntax/Corpus.h
23 ↗(On Diff #198427)

Went with Arena

26 ↗(On Diff #198427)

Operations on the trees sometimes need to know anything about underlying tokens - they have access to TokenBuffer that produced them.

More specifically, this can be used to map between spelled and expanded tokens and check the mappings are possible.

38 ↗(On Diff #198427)

The tokens have source locations and refer to a text in some buffer. tokenizeBuffer makes is easier, not harder, to mock tokens.

40 ↗(On Diff #198427)

I think C.construct<T>() read better than new (C) T(...). Not a big fan of placement new exprs.

clang/include/clang/Tooling/Syntax/Tree.h
35

For a translation unit. We will add versions that built for a subtree of the AST later.

41

Removed from the public API, we seem to have different ideas on how it should look like and I'd prefer to focus on storage model in this patch.

clang/include/clang/Tooling/Syntax/Tree/Cascade.h
2

Cascade defines a few base nodes: a composite node (TreeNode) and a leaf node that holds tokens.
I'd really like to isolate them from language-specific nodes, so language-specific nodes live in a separate file (Nodes.h).
However, they need to see the definition of a composite node, hence the split.

Users are advised to use an umbrella header, Tree.h. The extra directory is to minimize the number of headers in the top-level directory, having too many is confusing.

75

I don't think it's common, can use CompositeNode - seems like a better alternative

ilya-biryukov edited the summary of this revision. (Show Details)May 8 2019, 2:33 AM
  • s/corpus/arena
  • Remove an accidental cmake change

Definitely like the choice of CompositeNode owning the concrete storage!

clang/include/clang/Tooling/Syntax/Arena.h
2

From a user's point of view, this looks a lot like part of the tree structure in some sense.

If you expect that users will need to keep the arena rather than the TokenBuffer around (e.g. so nodes can be allocated), then it might make sense to declare it at the bottom of Cascade.h

clang/include/clang/Tooling/Syntax/Corpus.h
38 ↗(On Diff #198427)

Fair enough.

lexBuffer might be a slightly clearer name?

Who are the intended non-test users of this function? Are they better served by being able (and responsible) for constructing a MemoryBuffer with e.g. a sensible name and ownership, or would it be better to pass a StringRef and have the Arena come up with a sensible "anonymous" name?

40 ↗(On Diff #198427)

They are fairly consistently used in llvm/clang for this sort of thing, though.

I find it valuable because arena allocations look otherwise a lot like buggy ownership patterns. The dedicated syntax calls out the unusual case: we're creating a new thing, and someone else owns it, but won't do anything with it.

clang/include/clang/Tooling/Syntax/Tree.h
12

As discussed offline:

  • I don't think (at this point) we need an umbrella header. Generic tree structure, specific node semantics, and operations like "build a tree" are distinct enough from a user POV that asking them to include headers is fine.
  • We may want an umbrella for the node types, if we end up splitting that, but no need yet.
  • Splitting generic tree stuff vs specific node stuff sounds good, but I think having them be sibling headers in Tooling/Syntax is enough - not sure about the Tree/ subdirectory.

So I'd suggest something like:

  • Tree/Cascade.h + Arena.h --> Tree.h
  • Tree.h -> BuildTree.h
  • Tree/Nodes.h + NodeKind.h --> Nodes.h

(have comments on some of these throughout)

clang/include/clang/Tooling/Syntax/Tree/Cascade.h
2

I like the separation - my concern here was the specific word "Cascade".
I'd suggest "Tree" here as it really does define the structure.
The existing "Tree.h" defines operations, and can be named after them. As discussed, I think the design is clean and doesn't need to be hidden by an umbrella header.

40

maybe add a comment - newly created nodes have no parent until added to one

60

syntax::Leaf vs syntax::Struct seems a little odd - it talks about the tree structure rather than the contents. (Unlike TreeNode/CompositeNode this is likely to be used for its specific semantics).

Maybe syntax::Tokens (though the s is subtle). syntax::Text?

65

you're going to want classof for each node type, a private copy constructor (for cloning), a friend statement to whatever does the cloning (or the clone() function itself, if it goes on the class...)

You may want to put this behind a DEFINE_NODE_BOILERPLATE(Leaf) macro :-(

75

What about Subtree?

clang/include/clang/Tooling/Syntax/Tree/NodeKind.h
2

Why is this separate from Nodes.h?

clang/include/clang/Tooling/Syntax/Tree/NodeList.h
2

Implementing custom containers is a bit sad.

Alternatives we discussed:

  • adapt bumpptrallocactor to std allocator and use std::vector: wastes a pointer from vector->allocator
  • use ArrayRef<Node*> or ArrayRef<Node> and require the whole list to be reallocated when children change (but *replacing* children is fine)
  • use a linked-list representation instead: Node* Node::NextSibling, CompositeNode::FirstChild. This fits the allocation strategy more nicely. You probably need only single links, and can add an iterator if needed.
sammccall added inline comments.May 21 2019, 7:24 AM
clang/include/clang/Tooling/Syntax/Tree/NodeKind.h
23

if you make this operator<<, then it's slightly more flexible I think (llvm::to_string also works).
It's not as fast, but I don't think that matters here?

clang/lib/Tooling/Syntax/BuildFromAST.cpp
2

I haven't reviewed this file yet :-)

ilya-biryukov added inline comments.May 21 2019, 7:46 AM
clang/include/clang/Tooling/Syntax/Corpus.h
38 ↗(On Diff #198427)

The only use-case in my prototype so far is creating token nodes for punctuation nodes, e.g. say you want to create an expr of the form <a>+<b>, where both a and b are existing expressions and you need to synthesize a leaf node for +.
We use this function to synthesize a buffer with the corresponding token.

All the use-cases I can imagine are around synthesizing syntax trees (as opposed to constructing them from the AST).

ilya-biryukov marked 24 inline comments as done.
  • Address comments
Herald added a project: Restricted Project. · View Herald TranscriptJun 3 2019, 9:57 AM

I've addressed most of the comments, except the naming ones.
We need a convention for naming the language nodes and names for composite and leaf structural nodes.

For "language" nodes, I suggest we use CompoundStatement, Recovery, TopLevelDeclaration, TemplateArgumentList, TypeTemplateArgument, etc. That is, we spell out the words in full, no shorthands like Stmt or Expr. That would make things a bit more verbose, but hopefully that helps distinguish from clang AST.

For structural nodes, see the relevant comment threads.

clang/include/clang/Tooling/Syntax/Arena.h
2

Now part of Tree.h

clang/include/clang/Tooling/Syntax/Corpus.h
38 ↗(On Diff #198427)

Renamed to lexBuffer. This does not have usages (yet), we can also remove it from this patch if needed.

40 ↗(On Diff #198427)

Removed in favour of placement new. I guess that also makes it a bit more natural to have separate storage for other nodes that have different lifetime (e.g. use a separate arena).

clang/include/clang/Tooling/Syntax/Tree.h
12

Changed to the proposed header structure, it looks good.
Had to move Leaf::classof and TreeNode::classof into a .cpp, they need a definition of NodeKind. Keeping those in a header file was a micro-optimization anyway, that's probably not too important.

clang/include/clang/Tooling/Syntax/Tree/Cascade.h
60

syntax::Tokens actually looks good, but we should rename the tokens() accessor somehow in that case.
I have only super-generic variants in my head: elements(), items(). Any better ideas?

65

I'd avoid using macros. As long as the amount of boilerplate is small, it's not too annoying to write.

And it is small for now, we only have a single constructor and a classof per node, so keeping it explicit in this patch seems ok.

We can revisit if more stuff pops up, but I think we do it without extra boilerplate per node for clone, and hopefully for other things too.

75

Subtree seems ok, although Composite conveys the meaning better to my taste.

Composite does not seem to work without the node suffix, though, and we probably don't want the suffix in other nodes, so I'm torn on this.

clang/include/clang/Tooling/Syntax/Tree/NodeKind.h
2

Not anymore. The original reason was that Tree.h need NodeKind for implementing casts and the implementation was in a header.

23

For some reason I thought gdb does not show the enum value names, so I added a named method to simplify debugging.

Turns out it does show the enum names, not just int values, so I'm perfectly happy to the stream output operator.

clang/include/clang/Tooling/Syntax/Tree/NodeList.h
2

No more custom containers, explicit tree structure seems to be better.

clang/lib/Tooling/Syntax/BuildFromAST.cpp
2

Please do!

ilya-biryukov added inline comments.Jun 3 2019, 10:02 AM
clang/include/clang/Tooling/Syntax/BuildTree.h
11 ↗(On Diff #202745)

This needs an update, will do in the next round!

I've addressed most of the comments, except the naming ones.
We need a convention for naming the language nodes and names for composite and leaf structural nodes.

For "language" nodes, I suggest we use CompoundStatement, Recovery, TopLevelDeclaration, TemplateArgumentList, TypeTemplateArgument, etc. That is, we spell out the words in full, no shorthands like Stmt or Expr. That would make things a bit more verbose, but hopefully that helps distinguish from clang AST.

SGTM.

clang/include/clang/Tooling/Syntax/Tree.h
105

as discussed offline, having a leaf node store a range of tokens (rather than just one) looks attractive now, but as the syntax tree gets more detailed there are relatively few cases where multiple consecutive tokens should really be undifferentiated siblings.

Might be better to bite the bullet now and make leaf hold a single token, so our consecutive token ranges become a linked list. This will also flush out accidental assumptions that only tokens in the same Leaf are adjacent.

Given this, I'm not sure there's a better name than syntax::Leaf. You might consider making it a pointer-like object dereferenceable to Token&.

112

As discussed, I think syntax::Tree might actually be a better name here.

"A Node is either a Tree or a Leaf" is a bit weird, but not too hard to remember I think.

130

(curious: why prepend rather than append?)

clang/lib/Tooling/Syntax/BuildTree.cpp
26 ↗(On Diff #202745)

I find "currently processed" a bit vague. "Pending"?

37 ↗(On Diff #202745)

maybe formNode, formToken, formRoot()?

37 ↗(On Diff #202745)

if syntax nodes strictly nest and we form left-to-right and bottom up, then why are there ever pending nodes that aren't in the range? Is it because we don't aggregate them as early as possible?

(edit: after offline discussion, there are precise invariants here that could be documented and asserted)

41 ↗(On Diff #202745)

Particularly in view of having tokens be 1:1 with Leaf, *constructing* the token nodes as part of higher level constructs / as part of recovery seems a little odd.

What if we constructed all the leaf nodes up front, forming a linked list:
int -> a -> = -> 2 -> + -> 2 -> ; -> eof

When you form a node that covers a range, you splice out the nodes in that range, replacing with the new node:

int -> a -> = -> (2 + 2) -> ; -> eof
(int a = (2 + 2)) -> ; -> eof
etc

Then the invariant is you have a forest, the roots form a linked list, and the trees' respective leaves are a left-to-right partition of the input tokens.

I think this would mean:

  • no separate vector<RangedNode> data structure (AFAICS we can reuse Node)
  • don't have the requirement that the formed node must claim a suffix of the pending nodes, which simplifies the recursive AST visitior

We lose the binary search, but I think tracking the last accessed root (current position) and linear searching left/right will be at least as good in practice, because tree traversal is fundamentally pretty local.

48 ↗(On Diff #202745)

any particular reason learnRoot() and root() are different functions?

If learnRoot() returned TranslationUnit*, then we avoid the need for the caller to know about the dependency, it would track the state itself.

55 ↗(On Diff #202745)

or NodeForRange

92 ↗(On Diff #202745)

So explicitly claiming the primitive tokens, but implicitly claiming the subtrees, seems like a weird mix.
Having both explicit might be nicer:

  • It seems somewhat likely we want to record/tag their semantics (maybe in the child itself, or even the low bits of its pointer?), rather than having accessors scan around looking for something likely.
  • currently when expected subtrees fail to parse, their tokens get (implicitly) wrapped up in Recovery nodes. They're good targets for heuristic parsing, but this probably means we should record what an unexplained range of tokens is supposed to be for.

Thinking of something like:

builder.expectToken(l_brace, S->getLBracLoc());
builder.expectTree(Statement, S->getBody());
builder.expectToken(r_brace, S->getRBracLoc());

where the builder would react to non-null AST nodes by mapping the associated syntax node, and null AST nodes by trying to heuristically parse the tokens in between LBracLoc and RBracLoc.

But lots of unanswered questions here: body is a list of statements, how does that work? What if LBracLoc or RBracLoc is missing? etc.

93 ↗(On Diff #202745)

btw what if LBracLoc or RBracLoc are invalid here due to parser recovery?

135 ↗(On Diff #202745)

this function needs some high-level implementation comments

164 ↗(On Diff #202745)

why can It not point to a node that spans/is past End?

(edit after offline discussion: it's an important invariant that we're always consuming a suffix of the pending nodes)

226 ↗(On Diff #202745)

or It = bsearch(Tokens, [&](const Syntax::Token& L) { return !SM.isBeforeInTranslationUnit(L.location(), TokLoc); })

ilya-biryukov marked 7 inline comments as done.
  • A leaf node stores a single token
  • Restructure code to avoid special-casing leaf nodes

This is not 100% ready yet, but wanted to send it out anyway, as I'll be on vacation until Tuesday.

I've addressed most major comments. In particular, TreeBuilder now looks simpler (and more structured) to my taste.
One thing that's missing is adding children in arbitrary order. It won't be too complicated (would require some thought on how to properly create recovery nodes, though). I'd be tempted to land the current implementation as is and allow adding children in arbitrary order in a separate change (alongside more types of nodes), but let me know what you think.

clang/include/clang/Tooling/Syntax/Tree.h
105

Leaf now stores a single token, that actually simplifies things quite a bit, thanks!
I'd avoid making it a pointer-like object, given that nodes are often passed as pointers on their own. Making them a pointer-like object would mean we can get code that does double deferences (Tok = **Leaf).

130

Appending to a linked list is O(n). If we reverse it, traversing left-to-right order is O(n).

130

Append is O(n) in the current representation as it requires walking to the tail of the list.

clang/lib/Tooling/Syntax/BuildTree.cpp
41 ↗(On Diff #202745)

I went with a slightly different approach, similar to how parser does it. Please let me know what you think.
The new Forest helper struct ensures the tree structure invariants (all tokens must be covered, nodes must nest properly based on a syntax structure), the rest of the code in tree builder takes care of folding the nodes in a proper order and properly advancing the token stream (it's somewhat similar in the details of how parsers are implemented, except that instead of parsing we actually walk a pre-parsed AST).

It still needs some comments, but I think its intentions should be clear.
Let me know what you think, happy to discuss this offline too.

48 ↗(On Diff #202745)

learnRoot is called inside ast visitor when processing TranslationUnitDecl and root() is used to consume the result.

I guess we could just delay learnRoot until consume() is called, shouldn't be a big deal. I'll do this in the next iteration.

93 ↗(On Diff #202745)

This will currently break and we should definitely fix this

93 ↗(On Diff #202745)

We don't recover from errors properly here, I'd add a FIXME and figure this out later. Does that SG?

The general strategy I would propose is to just skip the tokens (we will return null from the corresponding accessors, etc) and create recovery nodes (RecoveryExpression, etc.) for composite nodes.

  • Do the renames
  • A few more renames and docs
  • Cleanups and comments
  • Reformat the code
ilya-biryukov marked 13 inline comments as done.Jun 21 2019, 5:07 AM

This is ready for another round now.

clang/lib/Tooling/Syntax/BuildTree.cpp
37 ↗(On Diff #202745)

Sorry, missed this comment and went with expectToken() and expectNode(), root is now built on consume(). Can change to form*, not a big deal.

I still need to write good docs about the invariants here, so leaving this open.

48 ↗(On Diff #202745)

consume() now builds the root node.

92 ↗(On Diff #202745)

That's exactly the design we have one, with a limitation that expect* have to be called in left-to-right and bottom-up manner. Also, you can only build a tree that ends at the tokens that were consumed.

This is actually a reasonable interface to build a tree from an actual parser, but might feel weird for a ast-to-syntax transformation. I need to figure out a way to write good docs about it, but there's a separate comment for that. Marking this as done (although the questions you mentioned at the end are still there)

135 ↗(On Diff #202745)

Done. Also renamed it to consumeNode.

The docs are very short, though, might need a revamp for clarity.

164 ↗(On Diff #202745)

This is now spelled out in the documentation for foldChildren.

ilya-biryukov marked 4 inline comments as done.Jun 21 2019, 5:08 AM

Although there are still rough edges, I believe the storage model is agreed upon and we can hopefully address the rest in the follow-ups.

  • Introduce roles to allow distinguishing the child nodes.
  • Remove recovery node, use an unknown role instead.
  • TreeBuidler now can consume children at any point, not just suffix nodes.
  • Remove (outdated) changes to gn files

This is now in a pretty good shape, I've incorporated changes after our offline discussions about child roles.
The builder interface is also much richer now, removing a requirement that the tree has to be traversed left-to-right (bottom-up is still required!).

sammccall accepted this revision.Jul 1 2019, 7:21 AM

Nice, let's land this!

clang/include/clang/Tooling/Syntax/Nodes.h
35 ↗(On Diff #206715)

I don't think TU is actually a declaration. Is there a reason to consider it one from a syntax POV?

41 ↗(On Diff #206715)

I have a slight feeling the EOF token is going to be annoying, e.g. can't just splice stuff in at the end of the list. But not sure if it'll be a big deal, and whether the alternatives are better.

43 ↗(On Diff #206715)

we discussed offline - with 256 values, is it possible to come up with a single role enum that would cover all node types?

Advantage would be that certain logic could be generic (e.g. Recovery could be a role for leaves under any Tree, LParen/RParen/MainKeyword could apply to if, while, switch...)

This revision is now accepted and ready to land.Jul 1 2019, 7:21 AM
ilya-biryukov marked 4 inline comments as done.
  • s/TranslationUnitDeclaration/TranslationUnit
  • Remove accessor from 'eof', add a FIXME to remove it from the tree altogether
ilya-biryukov added inline comments.Jul 8 2019, 10:24 AM
clang/include/clang/Tooling/Syntax/Nodes.h
35 ↗(On Diff #206715)

It's a declaration in a sense that it has a corresponding instance of clang::Decl that it "introduces", i.e. the clang::TranslationUnitDecl.

But you are right, TranslationUnit is a better name: this aligns with the C++ grammar from the standard (translation-unit), it does lack similarity with other declarations from the standard (and from clang).

Renamed to TranslationUnit

41 ↗(On Diff #206715)

That's a good point, I actually don't see how eof token in a tree would be useful (it's probably fine to have in the TokenBuffer, though, allows). Moreover, it could cause confusion and bugs when working with multiple translation units (I imagine moving nodes between two different TUs and ending up with multiple eofs somewhere)

I've removed the corresponding accessor from the TranslationUnit node and added a FIXME to remove it from the tree altogether.

43 ↗(On Diff #206715)

Will need to do some estimations to answer this properly, but my gut feeling is that 256 could end up being too limiting in the long run (I would expect each node to have at least one child, so without deduplication we can at least as many roles as we have kinds).

Could imagine a two-level numbering scheme, though:

  • some generic roles like lparen, rparen, etc, take first N roles.
  • higher numbers are for node-specific roles (e.g. LHS or RHS of a BinaryExpr).

But at that point, we probably don't have the benefits of a single enum.

This revision was automatically updated to reflect the committed changes.

@ilya-biryukov We're seeing buildbot failures in SyntaxTests.exe :
http://lab.llvm.org:8011/builders/llvm-clang-lld-x86_64-scei-ps4-ubuntu-fast/builds/50927
http://lab.llvm.org:8011/builders/llvm-clang-lld-x86_64-scei-ps4-windows10pro-fast/builds/26822

Failing Tests (1):

Clang-Unit :: Tooling/Syntax/./SyntaxTests.exe/SyntaxTreeTest.Basic

Sorry about that. That's the same error we had with the previous patch. Will send a fix right away.

RKSimon reopened this revision.Jul 9 2019, 4:30 AM

@ilya-biryukov I'm sorry but I've reverted this at rL365465

This revision is now accepted and ready to land.Jul 9 2019, 4:30 AM
RKSimon requested changes to this revision.Jul 9 2019, 4:30 AM
This revision now requires changes to proceed.Jul 9 2019, 4:30 AM

Relanded in rL365466 with a fix to the crash.

ilya-biryukov abandoned this revision.Jul 9 2019, 4:35 AM
sammccall added inline comments.Jul 9 2019, 6:43 AM
clang/include/clang/Tooling/Syntax/Nodes.h
43 ↗(On Diff #206715)

I think we misunderstood each other here... I think this is fairly important, and that we'd agreed on it in offline discussion. Didn't mean to leave it as an optional comment.

I'd be very surprised if 256 were too limiting. Indeed most nodes will have children, but most of them will not have unique roles. (And I would be surprised if we have 200 node types, but maybe not that surprised...).

If there's a more fundamental objection to merging these, I'd like to find some agreement before going further.