Page MenuHomePhabricator

[Syntax] Add nodes for most common statements

Authored by ilya-biryukov on Jun 26 2019, 11:47 AM.



Most of the statements mirror the ones provided by clang AST.
Major differences are:

  • expressions are wrapped into 'ExpressionStatement' instead of being a subclass of statement,
  • semicolons are always consumed by the leaf expressions (return, expression satement, etc),
  • some clang statements are not handled yet, we wrap those into an UnknownStatement class, which is not present in clang.

We also define an 'Expression' and 'UnknownExpression' classes in order
to produce 'ExpressionStatement' where needed. The actual implementation
of expressions is not yet ready, it will follow later.

Event Timeline

ilya-biryukov created this revision.Jun 26 2019, 11:47 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 26 2019, 11:47 AM

This change mostly aims to illustrate that TreeBuilder seems to be powerful enough to go beyond basic nodes.
But it also introduces enough nodes to make the syntax trees minimally useful for traversing statement nodes. Hopefully that could become a good basis to define other APIs (mutations, etc).

sammccall added inline comments.Jul 8 2019, 10:46 AM

there are going to be many of these. I'd suggest either sorting them all, or breaking them into blocks (e.g. statements vs declarations vs leaf/tu/etc) and sorting those blocks.


Do you want to expose the statement-ending-semicolon here?

(Not all statements have it, but common enough you may want it in the base class instead of all children)


The fact that this can be an arbitrary statement is kind of shocking. But apparently true!

In the long run, we're probably going to be able to find the case statements somehow, even though they're not part of the immediate grammar. Not sure whether this should be via the regular AST or by adding links here. Anyway, problem for another day.


expression for the value?


syntactically, is it useful to model the body as a single statement? It's not a CompoundStmt as it has no braces. Seems like a sequence...

Or is the idea that the first following statement is the body (might be nothing), and subsequent ones aren't part of the body? Why is this more useful than making the body a sibling?


might be handy to unify this with CaseStatement somehow (a base class, or make it literally a CaseStatement with a null body and a bool isDefaultCase() that looks at the keyword token)

Mostly thinking about code that's going to iterate over case statements.


I guess the missing cond here (and similar below) are due to complexities around the variable declaring variants?

Warrants a FIXME I think


I think throughout it's important to mark which of these are:

  • nullable in correct code
  • nullable in code generated by recovery

(any reason we can't already have the expr here?)

ilya-biryukov marked 2 inline comments as done.Jul 9 2019, 8:00 AM

Submitting a few comments to start up the discussions.

The actual changes will follow.


Yes, only "leaf" (i.e. the ones not having any statement children) have it.
I was thinking about:

  • having a separate class for non-composite statements and providing an accessor there,
  • providing an accessor in each of the leaf statements (would mean some duplication, but, arguably, better discoverability).

But, from an offline conversation, we seem to disagree that inheritance is a proper way to model this.
Would it be ok to do this in a follow-up? I'll add a FIXME for now.


This models the structure of the C++ grammar (and clang AST).
Getting from a switch statements to all its case and default labels seems useful, but should be addressed by a separate API that traverses the corresponding syntax tree nodes.

Marking as done, from an offline conversation we seem to agree here.
Feel free to reopen if needed.


I would model with with a base class, but let's agree whether that's the right way to approach this.


I would suggest to only mark the nodes that are nullable in the correct code. For recovery, I would assume the following rule (please tell me if I'm wrong):

On a construct whose parsing involved recovery:

  • if the node has an introducing token (if, try, etc.), the corresponding child cannot be null.
  • any other child can be null.
ilya-biryukov marked 5 inline comments as done.
  • Rebase
  • Address comments
  • Restructure the roles
  • Remove the role from tree dumps for now With too many roles it is annoying to update the test outputs on incremental changes. I tried using the symbolic role names there, but they end up being too verbose.
ilya-biryukov added inline comments.Jul 10 2019, 6:29 AM

Yes. Added a FIXME


Added a getter for it.

  • Mark groups of kinds for statements and expressions

This is ready for another round


I've added two blocks now - statements and expressions.
Did not sort, though, I find the semantic grouping (loop statements close to each other) more useful, but hard to keep consistent.

sammccall accepted this revision.Aug 5 2019, 4:20 AM
sammccall added inline comments.

Can you add a comment here saying the ordering/blocks must correspond to the Node inheritance hierarchy? This is *kind of* common knowledge, but I think this is normally handled by tablegen.


As discussed offline, there's some options about how abstract/concrete these roles should be.

e.g. for a list of function args, this could be FunctionOpenParen/FunctionArgExpr/FunctionArgComma/FunctionCloseParam (specific) <-> OpenParen/Arg/Comma/CloseParen <-> Open/Item/Separator/Close.

The more specific ones are somewhat redundant with the parent/child type (but easy to assign systematically), and the more generic ones are more orthogonal (but require more design and may by hard to always make consistent).

The concrete advantage of the generic roles is being able to write code like getTrailingSemicolon(Tree*) or findLoopBreak(Stmt*) or removeListItem(Tree*, int) in a fairly generic way, without resorting to adding a Loop base class or handling each case with separate code.

This is up to you, though.


First: yes, let's not do this now.

Second: I'm wary of using standard is-a inheritance to model more than alternation in the grammar. That is, ForStatement is-a Statement is fine, ForStatement is-a LoopyStatement is suspect. This is to do with the fact that LoopyStatement is-a Statement seems obvious, and we may end up with diamond-shaped inheritance and some conceptual confusion.

This goes for all traits that aren't natural tree-shaped inheritance: HasTrailingSemicolon, LoopyStatement, ...

I think there are two concerns here:

  • we want to be able to get the trailing-semicolon if it exists
  • we want to be able to check if the trailing-semicolon is *expected* including via its static type

One way to do this (not the only one...):

// generic helper, or callers could even write this directly
Optional<Leaf> trailingSemi(Tree *Node) {
  return firstElement(Node->Children<Leaf>(NodeRole::TrailingSemi));

// mixin for trailing semi support. Note: doesn't inherit Statement!
// maybe need/want some CRTP magic
class TrailingSemicolon {
  Optional<Leaf> trailingSemi() const { return trailingSemi((const Node*)this; }

// Gets the trailingSemi() accessor.
ExprStmt : public Statement, TrailingSemicolon { ... }

Agree with this strategy, and the fact that it doesn't need to be documented on every node/occurrence.

But it should definitely be documented somewhere at a high level! (With clang AST, this sort of thing feels like tribal knowledge)


nullable, marked somehow

Optional<Expression*> is tempting as a systematic and hard-to-ignore way of documenting that.
And it reflects the fact that there are three conceptual states for children: present, legally missing, brokenly missing.

At the same time, I'm not sure how to feel about the fact that in practice this can't be present but null, and the fact that *other* non-optional pointers can be null.


since Expr : Stmt, we need to be a bit wary of overloading based on static type.

It's tempting to say it's correct here: if we statically know E is an Expr, then maybe it's never correct to consume the semicolon. But is the converse true? e.g. if we're traversing using RAV and call getRange() in visitstmt...

(The alternatives seem to be a) removing the expr version of the function, and having the stmt version take a bool ConsumeSemi or b) change the stmt version to have (dynamic) expr behave like the expr overload, and handle it specially when forming exprstmt. More verbose, genuinely conflicted here)


maybe group with corresponding WalkUpFromCXXForRangeStmt?
(Could also group all Traverse* together if you prefer. Current ordering seems a little random)


nit: RAV?

This revision is now accepted and ready to land.Aug 5 2019, 4:20 AM
ilya-biryukov marked 20 inline comments as done.
  • Group Traverse* and Walk* together
  • s/RAT/RAV
  • Add a comment about nullability of the accessors
  • Name function for consuming statements and expressions differently

I definitely agree that writing generic functions is simpler with the proposed approach.

However, I am aiming for safer APIs here, albeit less generic. E.g. we'll have something like
removeFunctionArgument(ArgumentList*, int) and removeInitializer(InitializerList*, int)
rather than removeListItem(Tree*, int) in the public API.

Reasons are discoverability of the operations for particular node types.

Generic functions might still make sense as an implementation detail to share the code.
I'll keep as is for now, but will keep the suggestion in mind.


Added a corresponding comment to the file header.


Having Optional<Expression*> models the problem space better, but is much harder to use on the client side.
I'd keep as is, the file comment explains that one should assume all accessors can return null.

Update the comment here to indicate both return; and return <expr>; are represented by this node.


Using two functions with different names now.


Went for grouping Traverse* and Walk* together.
Normally would also choose to put related methods (i.e. WalkUpFromX and TraverseX) together, but they have totally different meaning here: TraverseX creates a workaround for suboptimal RAV traversals and WalkUpFromX actually builds the syntax tree.


Right, thanks. RAT was funnier, though. Even if incorrect...

This revision was automatically updated to reflect the committed changes.

Build result: fail - 59843 tests passed, 21 failed and 768 were skipped.

failed: lld.ELF/linkerscript/filename-spec.s
failed: Clang.Index/index-module-with-vfs.m
failed: Clang.Modules/double-quotes.m
failed: Clang.Modules/framework-public-includes-private.m
failed: Clang.VFS/external-names.c
failed: Clang.VFS/framework-import.m
failed: Clang.VFS/implicit-include.c
failed: Clang.VFS/include-mixed-real-and-virtual.c
failed: Clang.VFS/include-real-from-virtual.c
failed: Clang.VFS/include-virtual-from-real.c
failed: Clang.VFS/include.c
failed: Clang.VFS/incomplete-umbrella.m
failed: Clang.VFS/module-import.m
failed: Clang.VFS/module_missing_vfs.m
failed: Clang.VFS/real-path-found-first.m
failed: Clang.VFS/relative-path.c
failed: Clang.VFS/test_nonmodular.c
failed: Clang.VFS/umbrella-framework-import-skipnonexist.m
failed: Clang.VFS/vfsroot-include.c
failed: Clang.VFS/vfsroot-module.m
failed: Clang.VFS/vfsroot-with-overlay.c

Log files: console-log.txt, CMakeCache.txt