This is an archive of the discontinued LLVM Phabricator instance.

clang/include/clang/Tooling/Syntax/Corpus.h
23	I think plain SyntaxArena might be a better name here :-/ Corpus refers to texts (the use in dex is by analogy, as we call symbols "documents" from search).
26	MainFile is presumably the whole TU, name might need a tweak. Can it be empty? The relationship between Corpus and TokenBuffer seems a little weird. Why is it needed?
38	Are you planning to have a way to add tokens directly? Having to turn them into text and re-lex them seems like it might be inconvenient.
40	Now there's two ways to do this: `new (C.allocator()) T(...)` or `C.construct<T>(...)`. Do we need both? (If we do, is the syntax `new (C) T(...)` more natural?)
clang/include/clang/Tooling/Syntax/Tree.h
34	for a translation unit? or for only decls within the main file?
40	I've been burned with adding these APIs without use cases. It seems likely you want a way to: skip traversal of children abort the traversal entirely
clang/include/clang/Tooling/Syntax/Tree/Cascade.h
1	this is Cascade.h, not tree.h
1	why "cascade"?
1	The Tree/ subdirectory seems superfluous - why are these separate from Syntax/?
74	This use of "tree node" to mean specifically internal node seems confusing - is it common?

ilya-biryukov added inline comments.May 7 2019, 10:20 AM

clang/include/clang/Tooling/Syntax/Tree.h
40	Not having an option to abort traversal protects us against timing attacks... Agree with both, will address in this patch.

Make traverse() internal to its only use-site.
s/Corpus/Arena.
Address some other comments.

clang/include/clang/Tooling/Syntax/Corpus.h
23	Went with Arena
26	Operations on the trees sometimes need to know anything about underlying tokens - they have access to `TokenBuffer` that produced them. More specifically, this can be used to map between spelled and expanded tokens and check the mappings are possible.
38	The tokens have source locations and refer to a text in some buffer. `tokenizeBuffer` makes is easier, not harder, to mock tokens.
40	I think `C.construct<T>()` read better than `new (C) T(...)`. Not a big fan of placement new exprs.
clang/include/clang/Tooling/Syntax/Tree.h
34	For a translation unit. We will add versions that built for a subtree of the AST later.
40	Removed from the public API, we seem to have different ideas on how it should look like and I'd prefer to focus on storage model in this patch.
clang/include/clang/Tooling/Syntax/Tree/Cascade.h
1	Cascade defines a few base nodes: a composite node (`TreeNode`) and a leaf node that holds tokens. I'd really like to isolate them from language-specific nodes, so language-specific nodes live in a separate file (`Nodes.h`). However, they need to see the definition of a composite node, hence the split. Users are advised to use an umbrella header, `Tree.h`. The extra directory is to minimize the number of headers in the top-level directory, having too many is confusing.
74	I don't think it's common, can use `CompositeNode` - seems like a better alternative

ilya-biryukov edited the summary of this revision. (Show Details)May 8 2019, 2:33 AM

Harbormaster completed remote builds in B31588: Diff 198606.May 8 2019, 2:35 AM

s/corpus/arena
Remove an accidental cmake change

Harbormaster completed remote builds in B31591: Diff 198609.May 8 2019, 2:41 AM

Definitely like the choice of CompositeNode owning the concrete storage!

clang/include/clang/Tooling/Syntax/Arena.h
1 ↗	(On Diff #198609)	From a user's point of view, this looks a lot like part of the tree structure in some sense. If you expect that users will need to keep the arena rather than the TokenBuffer around (e.g. so nodes can be allocated), then it might make sense to declare it at the bottom of `Cascade.h`
clang/include/clang/Tooling/Syntax/Corpus.h
38	Fair enough. `lexBuffer` might be a slightly clearer name? Who are the intended non-test users of this function? Are they better served by being able (and responsible) for constructing a MemoryBuffer with e.g. a sensible name and ownership, or would it be better to pass a StringRef and have the Arena come up with a sensible "anonymous" name?
40	They are fairly consistently used in llvm/clang for this sort of thing, though. I find it valuable because arena allocations look otherwise a lot like buggy ownership patterns. The dedicated syntax calls out the unusual case: we're creating a new thing, and someone else owns it, but won't do anything with it.
clang/include/clang/Tooling/Syntax/Tree.h
12	As discussed offline: I don't think (at this point) we need an umbrella header. Generic tree structure, specific node semantics, and operations like "build a tree" are distinct enough from a user POV that asking them to include headers is fine. We may want an umbrella for the node types, if we end up splitting that, but no need yet. Splitting generic tree stuff vs specific node stuff sounds good, but I think having them be sibling headers in `Tooling/Syntax` is enough - not sure about the `Tree/` subdirectory. So I'd suggest something like: `Tree/Cascade.h` + `Arena.h` --> `Tree.h` `Tree.h` -> `BuildTree.h` `Tree/Nodes.h` + `NodeKind.h` --> `Nodes.h` (have comments on some of these throughout)
clang/include/clang/Tooling/Syntax/Tree/Cascade.h
1	I like the separation - my concern here was the specific word "Cascade". I'd suggest "Tree" here as it really does define the structure. The existing "Tree.h" defines operations, and can be named after them. As discussed, I think the design is clean and doesn't need to be hidden by an umbrella header.
40	maybe add a comment - newly created nodes have no parent until added to one
60	`syntax::Leaf` vs `syntax::Struct` seems a little odd - it talks about the tree structure rather than the contents. (Unlike TreeNode/CompositeNode this is likely to be used for its specific semantics). Maybe `syntax::Tokens` (though the s is subtle). `syntax::Text`?
65	you're going to want classof for each node type, a private copy constructor (for cloning), a friend statement to whatever does the cloning (or the clone() function itself, if it goes on the class...) You may want to put this behind a DEFINE_NODE_BOILERPLATE(Leaf) macro :-(
74	What about Subtree?
clang/include/clang/Tooling/Syntax/Tree/NodeKind.h
2	Why is this separate from Nodes.h?
clang/include/clang/Tooling/Syntax/Tree/NodeList.h
2	Implementing custom containers is a bit sad. Alternatives we discussed: adapt bumpptrallocactor to std allocator and use std::vector: wastes a pointer from vector->allocator use ArrayRef<Node> or ArrayRef<Node> and require the whole list to be reallocated when children change (but replacing* children is fine) use a linked-list representation instead: Node* Node::NextSibling, CompositeNode::FirstChild. This fits the allocation strategy more nicely. You probably need only single links, and can add an iterator if needed.

sammccall added inline comments.May 21 2019, 7:24 AM

clang/include/clang/Tooling/Syntax/Tree/NodeKind.h
23	if you make this `operator<<`, then it's slightly more flexible I think (llvm::to_string also works). It's not as fast, but I don't think that matters here?
clang/lib/Tooling/Syntax/BuildFromAST.cpp
2	I haven't reviewed this file yet :-)

ilya-biryukov added inline comments.May 21 2019, 7:46 AM

clang/include/clang/Tooling/Syntax/Corpus.h
38	The only use-case in my prototype so far is creating token nodes for punctuation nodes, e.g. say you want to create an expr of the form `<a>+<b>`, where both `a` and `b` are existing expressions and you need to synthesize a leaf node for `+`. We use this function to synthesize a buffer with the corresponding token. All the use-cases I can imagine are around synthesizing syntax trees (as opposed to constructing them from the AST).

Address comments

Herald added a project: Restricted Project. · View Herald TranscriptJun 3 2019, 9:57 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

I've addressed most of the comments, except the naming ones.
We need a convention for naming the language nodes and names for composite and leaf structural nodes.

For "language" nodes, I suggest we use CompoundStatement, Recovery, TopLevelDeclaration, TemplateArgumentList, TypeTemplateArgument, etc. That is, we spell out the words in full, no shorthands like Stmt or Expr. That would make things a bit more verbose, but hopefully that helps distinguish from clang AST.

For structural nodes, see the relevant comment threads.

clang/include/clang/Tooling/Syntax/Arena.h
1 ↗	(On Diff #198609)	Now part of `Tree.h`
clang/include/clang/Tooling/Syntax/Corpus.h
38	Renamed to `lexBuffer`. This does not have usages (yet), we can also remove it from this patch if needed.
40	Removed in favour of placement new. I guess that also makes it a bit more natural to have separate storage for other nodes that have different lifetime (e.g. use a separate arena).
clang/include/clang/Tooling/Syntax/Tree.h
12	Changed to the proposed header structure, it looks good. Had to move `Leaf::classof` and `TreeNode::classof` into a `.cpp`, they need a definition of `NodeKind`. Keeping those in a header file was a micro-optimization anyway, that's probably not too important.
clang/include/clang/Tooling/Syntax/Tree/Cascade.h
60	`syntax::Tokens` actually looks good, but we should rename the `tokens()` accessor somehow in that case. I have only super-generic variants in my head: `elements()`, `items()`. Any better ideas?
65	I'd avoid using macros. As long as the amount of boilerplate is small, it's not too annoying to write. And it is small for now, we only have a single constructor and a `classof` per node, so keeping it explicit in this patch seems ok. We can revisit if more stuff pops up, but I think we do it without extra boilerplate per node for clone, and hopefully for other things too.
74	`Subtree` seems ok, although `Composite` conveys the meaning better to my taste. `Composite` does not seem to work without the `node` suffix, though, and we probably don't want the suffix in other nodes, so I'm torn on this.
clang/include/clang/Tooling/Syntax/Tree/NodeKind.h
2	Not anymore. The original reason was that `Tree.h` need `NodeKind` for implementing casts and the implementation was in a header.
23	For some reason I thought `gdb` does not show the enum value names, so I added a named method to simplify debugging. Turns out it does show the enum names, not just int values, so I'm perfectly happy to the stream output operator.
clang/include/clang/Tooling/Syntax/Tree/NodeList.h
2	No more custom containers, explicit tree structure seems to be better.
clang/lib/Tooling/Syntax/BuildFromAST.cpp
2	Please do!

Harbormaster completed remote builds in B32829: Diff 202745.Jun 3 2019, 9:59 AM

ilya-biryukov added inline comments.Jun 3 2019, 10:02 AM

clang/include/clang/Tooling/Syntax/BuildTree.h
11 ↗	(On Diff #202745)	This needs an update, will do in the next round!

In D61637#1527727, @ilya-biryukov wrote:

I've addressed most of the comments, except the naming ones.
We need a convention for naming the language nodes and names for composite and leaf structural nodes.

For "language" nodes, I suggest we use CompoundStatement, Recovery, TopLevelDeclaration, TemplateArgumentList, TypeTemplateArgument, etc. That is, we spell out the words in full, no shorthands like Stmt or Expr. That would make things a bit more verbose, but hopefully that helps distinguish from clang AST.

SGTM.

clang/include/clang/Tooling/Syntax/Tree.h
105	as discussed offline, having a leaf node store a range of tokens (rather than just one) looks attractive now, but as the syntax tree gets more detailed there are relatively few cases where multiple consecutive tokens should really be undifferentiated siblings. Might be better to bite the bullet now and make leaf hold a single token, so our consecutive token ranges become a linked list. This will also flush out accidental assumptions that only tokens in the same Leaf are adjacent. Given this, I'm not sure there's a better name than `syntax::Leaf`. You might consider making it a pointer-like object dereferenceable to Token&.
112	As discussed, I think `syntax::Tree` might actually be a better name here. "A Node is either a Tree or a Leaf" is a bit weird, but not too hard to remember I think.
130	(curious: why prepend rather than append?)
clang/lib/Tooling/Syntax/BuildTree.cpp
26 ↗	(On Diff #202745)	I find "currently processed" a bit vague. "Pending"?
37 ↗	(On Diff #202745)	maybe formNode, formToken, formRoot()?
37 ↗	(On Diff #202745)	if syntax nodes strictly nest and we form left-to-right and bottom up, then why are there ever pending nodes that aren't in the range? Is it because we don't aggregate them as early as possible? (edit: after offline discussion, there are precise invariants here that could be documented and asserted)
41 ↗	(On Diff #202745)	Particularly in view of having tokens be 1:1 with Leaf, constructing the token nodes as part of higher level constructs / as part of recovery seems a little odd. What if we constructed all the leaf nodes up front, forming a linked list: `int -> a -> = -> 2 -> + -> 2 -> ; -> eof` When you form a node that covers a range, you splice out the nodes in that range, replacing with the new node: `int -> a -> = -> (2 + 2) -> ; -> eof` `(int a = (2 + 2)) -> ; -> eof` etc Then the invariant is you have a forest, the roots form a linked list, and the trees' respective leaves are a left-to-right partition of the input tokens. I think this would mean: no separate vector<RangedNode> data structure (AFAICS we can reuse Node) don't have the requirement that the formed node must claim a suffix of the pending nodes, which simplifies the recursive AST visitior We lose the binary search, but I think tracking the last accessed root (current position) and linear searching left/right will be at least as good in practice, because tree traversal is fundamentally pretty local.
48 ↗	(On Diff #202745)	any particular reason learnRoot() and root() are different functions? If learnRoot() returned TranslationUnit*, then we avoid the need for the caller to know about the dependency, it would track the state itself.
55 ↗	(On Diff #202745)	or NodeForRange
92 ↗	(On Diff #202745)	So explicitly claiming the primitive tokens, but implicitly claiming the subtrees, seems like a weird mix. Having both explicit might be nicer: It seems somewhat likely we want to record/tag their semantics (maybe in the child itself, or even the low bits of its pointer?), rather than having accessors scan around looking for something likely. currently when expected subtrees fail to parse, their tokens get (implicitly) wrapped up in Recovery nodes. They're good targets for heuristic parsing, but this probably means we should record what an unexplained range of tokens is supposed to be for. Thinking of something like: builder.expectToken(l_brace, S->getLBracLoc()); builder.expectTree(Statement, S->getBody()); builder.expectToken(r_brace, S->getRBracLoc()); where the builder would react to non-null AST nodes by mapping the associated syntax node, and null AST nodes by trying to heuristically parse the tokens in between LBracLoc and RBracLoc. But lots of unanswered questions here: body is a list of statements, how does that work? What if LBracLoc or RBracLoc is missing? etc.
93 ↗	(On Diff #202745)	btw what if LBracLoc or RBracLoc are invalid here due to parser recovery?
135 ↗	(On Diff #202745)	this function needs some high-level implementation comments
164 ↗	(On Diff #202745)	why can It not point to a node that spans/is past End? (edit after offline discussion: it's an important invariant that we're always consuming a suffix of the pending nodes)
226 ↗	(On Diff #202745)	or It = bsearch(Tokens, [&](const Syntax::Token& L) { return !SM.isBeforeInTranslationUnit(L.location(), TokLoc); })

A leaf node stores a single token
Restructure code to avoid special-casing leaf nodes

Harbormaster completed remote builds in B33331: Diff 204488.Jun 13 2019, 4:56 AM

This is not 100% ready yet, but wanted to send it out anyway, as I'll be on vacation until Tuesday.

I've addressed most major comments. In particular, TreeBuilder now looks simpler (and more structured) to my taste.
One thing that's missing is adding children in arbitrary order. It won't be too complicated (would require some thought on how to properly create recovery nodes, though). I'd be tempted to land the current implementation as is and allow adding children in arbitrary order in a separate change (alongside more types of nodes), but let me know what you think.

clang/include/clang/Tooling/Syntax/Tree.h
105	Leaf now stores a single token, that actually simplifies things quite a bit, thanks! I'd avoid making it a pointer-like object, given that nodes are often passed as pointers on their own. Making them a pointer-like object would mean we can get code that does double deferences (Tok = **Leaf).
130	Appending to a linked list is `O(n)`. If we reverse it, traversing left-to-right order is `O(n)`.
130	Append is `O(n)` in the current representation as it requires walking to the tail of the list.
clang/lib/Tooling/Syntax/BuildTree.cpp
41 ↗	(On Diff #202745)	I went with a slightly different approach, similar to how parser does it. Please let me know what you think. The new `Forest` helper struct ensures the tree structure invariants (all tokens must be covered, nodes must nest properly based on a syntax structure), the rest of the code in tree builder takes care of folding the nodes in a proper order and properly advancing the token stream (it's somewhat similar in the details of how parsers are implemented, except that instead of parsing we actually walk a pre-parsed AST). It still needs some comments, but I think its intentions should be clear. Let me know what you think, happy to discuss this offline too.
48 ↗	(On Diff #202745)	`learnRoot` is called inside ast visitor when processing `TranslationUnitDecl` and `root()` is used to consume the result. I guess we could just delay `learnRoot` until `consume()` is called, shouldn't be a big deal. I'll do this in the next iteration.
93 ↗	(On Diff #202745)	This will currently break and we should definitely fix this
93 ↗	(On Diff #202745)	We don't recover from errors properly here, I'd add a FIXME and figure this out later. Does that SG? The general strategy I would propose is to just skip the tokens (we will return null from the corresponding accessors, etc) and create recovery nodes (`RecoveryExpression`, etc.) for composite nodes.

Do the renames

Harbormaster completed remote builds in B33621: Diff 205589.Jun 19 2019, 7:32 AM

A few more renames and docs
Cleanups and comments
Reformat the code

Harbormaster completed remote builds in B33725: Diff 205974.Jun 21 2019, 4:58 AM

This is ready for another round now.

clang/lib/Tooling/Syntax/BuildTree.cpp
37 ↗	(On Diff #202745)	Sorry, missed this comment and went with `expectToken()` and `expectNode()`, root is now built on `consume()`. Can change to `form*`, not a big deal. I still need to write good docs about the invariants here, so leaving this open.
48 ↗	(On Diff #202745)	`consume()` now builds the root node.
92 ↗	(On Diff #202745)	That's exactly the design we have one, with a limitation that `expect*` have to be called in left-to-right and bottom-up manner. Also, you can only build a tree that ends at the tokens that were consumed. This is actually a reasonable interface to build a tree from an actual parser, but might feel weird for a ast-to-syntax transformation. I need to figure out a way to write good docs about it, but there's a separate comment for that. Marking this as done (although the questions you mentioned at the end are still there)
135 ↗	(On Diff #202745)	Done. Also renamed it to `consumeNode`. The docs are very short, though, might need a revamp for clarity.
164 ↗	(On Diff #202745)	This is now spelled out in the documentation for `foldChildren`.

Although there are still rough edges, I believe the storage model is agreed upon and we can hopefully address the rest in the follow-ups.

Introduce roles to allow distinguishing the child nodes.
Remove recovery node, use an unknown role instead.
TreeBuidler now can consume children at any point, not just suffix nodes.

Harbormaster completed remote builds in B33959: Diff 206714.Jun 26 2019, 11:20 AM

Remove (outdated) changes to gn files

Harbormaster completed remote builds in B33960: Diff 206715.Jun 26 2019, 11:22 AM

This is now in a pretty good shape, I've incorporated changes after our offline discussions about child roles.
The builder interface is also much richer now, removing a requirement that the tree has to be traversed left-to-right (bottom-up is still required!).

ilya-biryukov added a child revision: D63835: [Syntax] Add nodes for most common statements.Jun 26 2019, 11:48 AM

Nice, let's land this!

clang/include/clang/Tooling/Syntax/Nodes.h
35 ↗	(On Diff #206715)	I don't think TU is actually a declaration. Is there a reason to consider it one from a syntax POV?
41 ↗	(On Diff #206715)	I have a slight feeling the EOF token is going to be annoying, e.g. can't just splice stuff in at the end of the list. But not sure if it'll be a big deal, and whether the alternatives are better.
43 ↗	(On Diff #206715)	we discussed offline - with 256 values, is it possible to come up with a single role enum that would cover all node types? Advantage would be that certain logic could be generic (e.g. `Recovery` could be a role for leaves under any Tree, `LParen`/`RParen`/`MainKeyword` could apply to if, while, switch...)

This revision is now accepted and ready to land.Jul 1 2019, 7:21 AM

s/TranslationUnitDeclaration/TranslationUnit
Remove accessor from 'eof', add a FIXME to remove it from the tree altogether

Harbormaster completed remote builds in B34515: Diff 208454.Jul 8 2019, 10:23 AM

ilya-biryukov added inline comments.Jul 8 2019, 10:24 AM

clang/include/clang/Tooling/Syntax/Nodes.h
35 ↗	(On Diff #206715)	It's a declaration in a sense that it has a corresponding instance of `clang::Decl` that it "introduces", i.e. the `clang::TranslationUnitDecl`. But you are right, `TranslationUnit` is a better name: this aligns with the C++ grammar from the standard (`translation-unit`), it does lack similarity with other declarations from the standard (and from clang). Renamed to `TranslationUnit`
41 ↗	(On Diff #206715)	That's a good point, I actually don't see how `eof` token in a tree would be useful (it's probably fine to have in the `TokenBuffer`, though, allows). Moreover, it could cause confusion and bugs when working with multiple translation units (I imagine moving nodes between two different TUs and ending up with multiple `eof`s somewhere) I've removed the corresponding accessor from the `TranslationUnit` node and added a FIXME to remove it from the tree altogether.
43 ↗	(On Diff #206715)	Will need to do some estimations to answer this properly, but my gut feeling is that 256 could end up being too limiting in the long run (I would expect each node to have at least one child, so without deduplication we can at least as many roles as we have kinds). Could imagine a two-level numbering scheme, though: some generic roles like `lparen`, `rparen`, etc, take first `N` roles. higher numbers are for node-specific roles (e.g. LHS or RHS of a `BinaryExpr`). But at that point, we probably don't have the benefits of a single enum.

Closed by commit rGb736969eddce: [Syntax] Introduce syntax trees (authored by ilya-biryukov). · Explain WhyJul 8 2019, 10:27 AM

This revision was automatically updated to reflect the committed changes.

@ilya-biryukov We're seeing buildbot failures in SyntaxTests.exe :
http://lab.llvm.org:8011/builders/llvm-clang-lld-x86_64-scei-ps4-ubuntu-fast/builds/50927
http://lab.llvm.org:8011/builders/llvm-clang-lld-x86_64-scei-ps4-windows10pro-fast/builds/26822

Failing Tests (1):

Clang-Unit :: Tooling/Syntax/./SyntaxTests.exe/SyntaxTreeTest.Basic

In D61637#1575542, @RKSimon wrote:
@ilya-biryukov We're seeing buildbot failures in SyntaxTests.exe :
http://lab.llvm.org:8011/builders/llvm-clang-lld-x86_64-scei-ps4-ubuntu-fast/builds/50927
http://lab.llvm.org:8011/builders/llvm-clang-lld-x86_64-scei-ps4-windows10pro-fast/builds/26822

Failing Tests (1):
Clang-Unit :: Tooling/Syntax/./SyntaxTests.exe/SyntaxTreeTest.Basic

Sorry about that. That's the same error we had with the previous patch. Will send a fix right away.

@ilya-biryukov I'm sorry but I've reverted this at rL365465

This revision is now accepted and ready to land.Jul 9 2019, 4:30 AM

RKSimon requested changes to this revision.Jul 9 2019, 4:30 AM

This revision now requires changes to proceed.Jul 9 2019, 4:30 AM

Relanded in rL365466 with a fix to the crash.

ilya-biryukov abandoned this revision.Jul 9 2019, 4:35 AM

sammccall added inline comments.Jul 9 2019, 6:43 AM

clang/include/clang/Tooling/Syntax/Nodes.h
43 ↗	(On Diff #206715)	I think we misunderstood each other here... I think this is fairly important, and that we'd agreed on it in offline discussion. Didn't mean to leave it as an optional comment. I'd be very surprised if 256 were too limiting. Indeed most nodes will have children, but most of them will not have unique roles. (And I would be surprised if we have 200 node types, but maybe not that surprised...). If there's a more fundamental objection to merging these, I'd like to find some agreement before going further.

Revision Contents

Path

Size

clang/

include/

clang/

Tooling/

Syntax/

Corpus.h

61 lines

Tree.h

45 lines

Tree/

105 lines

26 lines

111 lines

70 lines

lib/

Tooling/

Syntax/

221 lines

7 lines

100 lines

30 lines

36 lines

30 lines

tools/

CMakeLists.txt

1 line

unittests/

Tooling/

Syntax/

CMakeLists.txt

5 lines

TreeTest.cpp

155 lines

Diff 198427

clang/include/clang/Tooling/Syntax/Corpus.h

This file was added.

				//===- Corpus.h - memory arena and bookkeeping for syntax trees --- C++ -*-===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				#ifndef LLVM_CLANG_TOOLING_SYNTAX_CORPUS_H
				#define LLVM_CLANG_TOOLING_SYNTAX_CORPUS_H

				#include "clang/Basic/LangOptions.h"
				#include "clang/Basic/SourceLocation.h"
				#include "clang/Basic/SourceManager.h"
				#include "clang/Tooling/Syntax/Tokens.h"
				#include "llvm/ADT/ArrayRef.h"
				#include "llvm/ADT/DenseMap.h"
				#include "llvm/Support/Allocator.h"

				namespace clang {
				namespace syntax {
				/// A memory arena for syntax trees. In addition, it also tracks the underlying
				/// token buffers, source manager, etc.
				class Corpus {
				sammccallUnsubmitted Done Reply Inline Actions I think plain SyntaxArena might be a better name here :-/ Corpus refers to texts (the use in dex is by analogy, as we call symbols "documents" from search). sammccall: I think plain SyntaxArena might be a better name here :-/ Corpus refers to texts (the use in…
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions Went with Arena ilya-biryukov: Went with Arena
				public:
				Corpus(SourceManager &SourceMgr, const LangOptions &LangOpts,
				TokenBuffer MainFile);
				sammccallUnsubmitted Not Done Reply Inline Actions MainFile is presumably the whole TU, name might need a tweak. Can it be empty? The relationship between Corpus and TokenBuffer seems a little weird. Why is it needed? sammccall: MainFile is presumably the whole TU, name might need a tweak. Can it be empty? The relationship…
				ilya-biryukovAuthorUnsubmitted Not Done Reply Inline Actions Operations on the trees sometimes need to know anything about underlying tokens - they have access to `TokenBuffer` that produced them. More specifically, this can be used to map between spelled and expanded tokens and check the mappings are possible. ilya-biryukov: Operations on the trees sometimes need to know anything about underlying tokens - they have…

				const SourceManager &sourceManager() const { return SourceMgr; }
				const LangOptions &langOptions() const { return LangOpts; }

				const TokenBuffer &tokenBuffer() const;
				llvm::BumpPtrAllocator &allocator() { return Allocator; }

				/// Add \p Buffer to the underlying source manager, tokenize it and store the
				/// resulting tokens. Useful when there is a need to materialize tokens that
				/// were not written in user code.
				std::pair<FileID, llvm::ArrayRef<syntax::Token>>
				tokenizeBuffer(std::unique_ptr<llvm::MemoryBuffer> Buffer);
				sammccallUnsubmitted Not Done Reply Inline Actions Are you planning to have a way to add tokens directly? Having to turn them into text and re-lex them seems like it might be inconvenient. sammccall: Are you planning to have a way to add tokens directly? Having to turn them into text and re-lex…
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions The tokens have source locations and refer to a text in some buffer. `tokenizeBuffer` makes is easier, not harder, to mock tokens. ilya-biryukov: The tokens have source locations and refer to a text in some buffer. `tokenizeBuffer` makes is…
				sammccallUnsubmitted Done Reply Inline Actions Fair enough. `lexBuffer` might be a slightly clearer name? Who are the intended non-test users of this function? Are they better served by being able (and responsible) for constructing a MemoryBuffer with e.g. a sensible name and ownership, or would it be better to pass a StringRef and have the Arena come up with a sensible "anonymous" name? sammccall: Fair enough. `lexBuffer` might be a slightly clearer name? Who are the intended non-test…
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions The only use-case in my prototype so far is creating token nodes for punctuation nodes, e.g. say you want to create an expr of the form `<a>+<b>`, where both `a` and `b` are existing expressions and you need to synthesize a leaf node for `+`. We use this function to synthesize a buffer with the corresponding token. All the use-cases I can imagine are around synthesizing syntax trees (as opposed to constructing them from the AST). ilya-biryukov: The only use-case in my prototype so far is creating token nodes for punctuation nodes, e.g.
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions Renamed to `lexBuffer`. This does not have usages (yet), we can also remove it from this patch if needed. ilya-biryukov: Renamed to `lexBuffer`. This does not have usages (yet), we can also remove it from this patch…

				/// Construct a new syntax node of a specified kind. The memory for a node is
				sammccallUnsubmitted Done Reply Inline Actions Now there's two ways to do this: `new (C.allocator()) T(...)` or `C.construct<T>(...)`. Do we need both? (If we do, is the syntax `new (C) T(...)` more natural?) sammccall: Now there's two ways to do this: `new (C.allocator()) T(...)` or `C.construct<T>(...)`. Do we…
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions I think `C.construct<T>()` read better than `new (C) T(...)`. Not a big fan of placement new exprs. ilya-biryukov: I think `C.construct<T>()` read better than `new (C) T(...)`. Not a big fan of placement new…
				sammccallUnsubmitted Done Reply Inline Actions They are fairly consistently used in llvm/clang for this sort of thing, though. I find it valuable because arena allocations look otherwise a lot like buggy ownership patterns. The dedicated syntax calls out the unusual case: we're creating a new thing, and someone else owns it, but won't do anything with it. sammccall: They are fairly consistently used in llvm/clang for this sort of thing, though. I find it…
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions Removed in favour of placement new. I guess that also makes it a bit more natural to have separate storage for other nodes that have different lifetime (e.g. use a separate arena). ilya-biryukov: Removed in favour of placement new. I guess that also makes it a bit more natural to have…
				/// owned by the corpus and will be freed when the corpus is destroyed.
				template <class TNode, class... Args> TNode *construct(Args &&... As) {
				static_assert(std::is_trivially_destructible<TNode>::value,
				"nodes should be trivially destructible");
				return new (Allocator) TNode(std::forward<Args>(As)...);
				}

				private:
				SourceManager &SourceMgr;
				const LangOptions &LangOpts;
				TokenBuffer Tokens;
				/// IDs and storage for additional tokenized files.
				llvm::DenseMap<FileID, std::vector<syntax::Token>> ExtraTokens;
				/// Keeps all the allocated nodes and their intermediate data structures.
				llvm::BumpPtrAllocator Allocator;
				};

				} // namespace syntax
				} // namespace clang

				#endif

clang/include/clang/Tooling/Syntax/Tree.h

This file was added.

				//===- Tree.h - syntax trees ----------------------------------- C++ --=====//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				// Syntax tree is a parse tree for the C++ source code designed to allow
				// mutations.
				//
				// The code is divided in the following way:
				// - 'Tree/Cascade.h' defines the basic structure of the syntax tree,
				sammccallUnsubmitted Done Reply Inline Actions As discussed offline: I don't think (at this point) we need an umbrella header. Generic tree structure, specific node semantics, and operations like "build a tree" are distinct enough from a user POV that asking them to include headers is fine. We may want an umbrella for the node types, if we end up splitting that, but no need yet. Splitting generic tree stuff vs specific node stuff sounds good, but I think having them be sibling headers in `Tooling/Syntax` is enough - not sure about the `Tree/` subdirectory. So I'd suggest something like: `Tree/Cascade.h` + `Arena.h` --> `Tree.h` `Tree.h` -> `BuildTree.h` `Tree/Nodes.h` + `NodeKind.h` --> `Nodes.h` (have comments on some of these throughout) sammccall: As discussed offline: - I don't think (at this point) we need an umbrella header. Generic tree…
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions Changed to the proposed header structure, it looks good. Had to move `Leaf::classof` and `TreeNode::classof` into a `.cpp`, they need a definition of `NodeKind`. Keeping those in a header file was a micro-optimization anyway, that's probably not too important. ilya-biryukov: Changed to the proposed header structure, it looks good. Had to move `Leaf::classof` and…
				// - 'Tree/Nodes.h' defines the concrete node classes, corresponding to the
				// - 'Tree.h' (this file) defines basic operations for constucting and
				// traversing a syntax tree.
				//
				// This is still work in progress and highly experimental, we leave room for
				// ourselves to completely change the design and/or implementation.
				//===----------------------------------------------------------------------===//
				#ifndef LLVM_CLANG_TOOLING_SYNTAX_TREE_H
				#define LLVM_CLANG_TOOLING_SYNTAX_TREE_H

				#include "clang/AST/Decl.h"
				#include "clang/Tooling/Core/Replacement.h"
				#include "clang/Tooling/Syntax/Corpus.h"
				#include "clang/Tooling/Syntax/Tree/Cascade.h"
				#include "clang/Tooling/Syntax/Tree/Nodes.h"
				#include "llvm/ADT/ArrayRef.h"
				#include "llvm/ADT/STLExtras.h"

				namespace clang {
				namespace syntax {

				/// Build a syntax tree for the main file.
				sammccallUnsubmitted Not Done Reply Inline Actions for a translation unit? or for only decls within the main file? sammccall: for a translation unit? or for only decls within the main file?
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions For a translation unit. We will add versions that built for a subtree of the AST later. ilya-biryukov: For a translation unit. We will add versions that built for a subtree of the AST later.
				TranslationUnit *buildSyntaxTree(Corpus &C,
				const clang::TranslationUnitDecl &TU);

				/// Perform a post-order traversal of the syntax tree, calling \p Visit at each
				/// node.
				void traverse(Node N, llvm::function_ref<void(Node )> Visit);
				sammccallUnsubmitted Not Done Reply Inline Actions I've been burned with adding these APIs without use cases. It seems likely you want a way to: skip traversal of children abort the traversal entirely sammccall: I've been burned with adding these APIs without use cases. It seems likely you want a way to…
				ilya-biryukovAuthorUnsubmitted Not Done Reply Inline Actions Not having an option to abort traversal protects us against timing attacks... Agree with both, will address in this patch. ilya-biryukov: Not having an option to abort traversal protects us against timing attacks... Agree with both…
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions Removed from the public API, we seem to have different ideas on how it should look like and I'd prefer to focus on storage model in this patch. ilya-biryukov: Removed from the public API, we seem to have different ideas on how it should look like and I'd…
				void traverse(const Node N, llvm::function_ref<void(const Node )> Visit);

				} // namespace syntax
				} // namespace clang
				#endif
				sammccallUnsubmitted Done Reply Inline Actions as discussed offline, having a leaf node store a range of tokens (rather than just one) looks attractive now, but as the syntax tree gets more detailed there are relatively few cases where multiple consecutive tokens should really be undifferentiated siblings. Might be better to bite the bullet now and make leaf hold a single token, so our consecutive token ranges become a linked list. This will also flush out accidental assumptions that only tokens in the same Leaf are adjacent. Given this, I'm not sure there's a better name than `syntax::Leaf`. You might consider making it a pointer-like object dereferenceable to Token&. sammccall: as discussed offline, having a leaf node store a range of tokens (rather than just one) looks…
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions Leaf now stores a single token, that actually simplifies things quite a bit, thanks! I'd avoid making it a pointer-like object, given that nodes are often passed as pointers on their own. Making them a pointer-like object would mean we can get code that does double deferences (Tok = Leaf). ilya-biryukov:** Leaf now stores a single token, that actually simplifies things quite a bit, thanks! I'd avoid…
				sammccallUnsubmitted Done Reply Inline Actions As discussed, I think `syntax::Tree` might actually be a better name here. "A Node is either a Tree or a Leaf" is a bit weird, but not too hard to remember I think. sammccall: As discussed, I think `syntax::Tree` might actually be a better name here. "A Node is either a…
				sammccallUnsubmitted Not Done Reply Inline Actions (curious: why prepend rather than append?) sammccall: (curious: why prepend rather than append?)
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions Appending to a linked list is `O(n)`. If we reverse it, traversing left-to-right order is `O(n)`. ilya-biryukov: Appending to a linked list is `O(n)`. If we reverse it, traversing left-to-right order is `O…
				ilya-biryukovAuthorUnsubmitted Not Done Reply Inline Actions Append is `O(n)` in the current representation as it requires walking to the tail of the list. ilya-biryukov: Append is `O(n)` in the current representation as it requires walking to the tail of the list.

clang/include/clang/Tooling/Syntax/Tree/Cascade.h

This file was added.

				//===- Tree.h - cascade of the syntax tree --------------------- C++ --=====//
				sammccallUnsubmitted Done Reply Inline Actions this is Cascade.h, not tree.h sammccall: this is Cascade.h, not tree.h
				sammccallUnsubmitted Done Reply Inline Actions why "cascade"? sammccall: why "cascade"?
				sammccallUnsubmitted Done Reply Inline Actions The Tree/ subdirectory seems superfluous - why are these separate from Syntax/? sammccall: The Tree/ subdirectory seems superfluous - why are these separate from Syntax/?
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions Cascade defines a few base nodes: a composite node (`TreeNode`) and a leaf node that holds tokens. I'd really like to isolate them from language-specific nodes, so language-specific nodes live in a separate file (`Nodes.h`). However, they need to see the definition of a composite node, hence the split. Users are advised to use an umbrella header, `Tree.h`. The extra directory is to minimize the number of headers in the top-level directory, having too many is confusing. ilya-biryukov: Cascade defines a few base nodes: a composite node (`TreeNode`) and a leaf node that holds…
				sammccallUnsubmitted Done Reply Inline Actions I like the separation - my concern here was the specific word "Cascade". I'd suggest "Tree" here as it really does define the structure. The existing "Tree.h" defines operations, and can be named after them. As discussed, I think the design is clean and doesn't need to be hidden by an umbrella header. sammccall: I like the separation - my concern here was the specific word "Cascade". I'd suggest "Tree"…
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				// Defines the basic structure of the syntax tree. There are two kinds of nodes
				// in a tree:
				// - leaf nodes correspond to a continuous subrange of tokens in expanded
				// token stream.
				// - composite (or tree) nodes correspond to language grammar constructs.
				//
				// The tree is initially built from an AST. Each node of a newly built tree
				// covers a continous subrange of expanded tokens (i.e. tokens after
				// preprocessing), the specific tokens coverered are stored in the leaf nodes of
				// a tree. A post-order traversal of a tree will visit leaf nodes in an order
				// corresponding the original order of expanded tokens.
				//===----------------------------------------------------------------------===//
				#ifndef LLVM_CLANG_TOOLING_SYNTAX_TREE_CASCADE_H
				#define LLVM_CLANG_TOOLING_SYNTAX_TREE_CASCADE_H

				#include "clang/Basic/TokenKinds.h"
				#include "clang/Tooling/Syntax/Corpus.h"
				#include "clang/Tooling/Syntax/Tree/NodeKind.h"
				#include "clang/Tooling/Syntax/Tree/NodeList.h"
				#include "llvm/ADT/ArrayRef.h"

				namespace clang {
				namespace syntax {

				class TreeNode;
				class TreeBuilder;
				class Corpus;

				/// A node in a syntax tree. Knows only about its parent and kind.
				class Node {
				public:
				Node(NodeKind Kind) : Parent(nullptr), Kind(Kind) {}
				NodeKind kind() const { return Kind; }
				sammccallUnsubmitted Done Reply Inline Actions maybe add a comment - newly created nodes have no parent until added to one sammccall: maybe add a comment - newly created nodes have no parent until added to one

				const TreeNode *parent() const { return Parent; }
				TreeNode *parent() { return Parent; }

				/// Dumps the structure of a subtree. For debugging and testing purposes.
				std::string dump(const Corpus &C) const;
				/// Dumps the tokens forming this subtree.
				std::string dumpTokens(const Corpus &C) const;

				private:
				// TreeNode is allowed to change the Parent link.
				friend class TreeNode;
				TreeNode *Parent;
				NodeKind Kind;
				};


				/// A leaf node points to a consecutive range of tokens in the expanded token
				/// stream.
				class Leaf final : public Node {
				sammccallUnsubmitted Not Done Reply Inline Actions `syntax::Leaf` vs `syntax::Struct` seems a little odd - it talks about the tree structure rather than the contents. (Unlike TreeNode/CompositeNode this is likely to be used for its specific semantics). Maybe `syntax::Tokens` (though the s is subtle). `syntax::Text`? sammccall: `syntax::Leaf` vs `syntax::Struct` seems a little odd - it talks about the tree structure…
				ilya-biryukovAuthorUnsubmitted Not Done Reply Inline Actions `syntax::Tokens` actually looks good, but we should rename the `tokens()` accessor somehow in that case. I have only super-generic variants in my head: `elements()`, `items()`. Any better ideas? ilya-biryukov: `syntax::Tokens` actually looks good, but we should rename the `tokens()` accessor somehow in…
				public:
				Leaf(llvm::ArrayRef<syntax::Token> Tokens)
				: Node(NodeKind::Leaf), Tokens(Tokens) {}

				static bool classof(const Node *N) { return N->kind() == NodeKind::Leaf; }
				sammccallUnsubmitted Not Done Reply Inline Actions you're going to want classof for each node type, a private copy constructor (for cloning), a friend statement to whatever does the cloning (or the clone() function itself, if it goes on the class...) You may want to put this behind a DEFINE_NODE_BOILERPLATE(Leaf) macro :-( sammccall: you're going to want classof for each node type, a private copy constructor (for cloning), a…
				ilya-biryukovAuthorUnsubmitted Not Done Reply Inline Actions I'd avoid using macros. As long as the amount of boilerplate is small, it's not too annoying to write. And it is small for now, we only have a single constructor and a `classof` per node, so keeping it explicit in this patch seems ok. We can revisit if more stuff pops up, but I think we do it without extra boilerplate per node for clone, and hopefully for other things too. ilya-biryukov: I'd avoid using macros. As long as the amount of boilerplate is small, it's not too annoying to…

				llvm::ArrayRef<syntax::Token> tokens() const { return Tokens; }

				private:
				llvm::ArrayRef<Token> Tokens;
				};

				/// A composite tree node that has children.
				class TreeNode : public Node {
				sammccallUnsubmitted Not Done Reply Inline Actions This use of "tree node" to mean specifically internal node seems confusing - is it common? sammccall: This use of "tree node" to mean specifically internal node seems confusing - is it common?
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions I don't think it's common, can use `CompositeNode` - seems like a better alternative ilya-biryukov: I don't think it's common, can use `CompositeNode` - seems like a better alternative
				sammccallUnsubmitted Not Done Reply Inline Actions What about Subtree? sammccall: What about Subtree?
				ilya-biryukovAuthorUnsubmitted Not Done Reply Inline Actions `Subtree` seems ok, although `Composite` conveys the meaning better to my taste. `Composite` does not seem to work without the `node` suffix, though, and we probably don't want the suffix in other nodes, so I'm torn on this. ilya-biryukov: `Subtree` seems ok, although `Composite` conveys the meaning better to my taste. `Composite`…
				public:
				using Node::Node;
				static bool classof(const Node *N) { return N->kind() > NodeKind::Leaf; }

				llvm::ArrayRef<Node *> children() {
				return llvm::makeArrayRef(Children.begin(), Children.end());
				}
				llvm::ArrayRef<const Node *> children() const {
				return llvm::makeArrayRef(Children.begin(), Children.end());
				}

				protected:
				/// Find a leaf with a single token of a corresponding kind.
				/// EXPECTS: the leaf node of a corresponding kind is found.
				/// ENSURES: the result is non-null.
				syntax::Leaf* findLeaf(tok::TokenKind K);

				private:
				/// Appends \p Child to the list of children and and sets the parent pointer.
				/// A very low-level operation that does not check any invariants, only used
				/// by TreeBuilder.
				void appendChildLowLevel(Corpus &C, Node *Child);
				friend class TreeBuilder;

				NodeList Children;
				};

				} // namespace syntax
				} // namespace clang

				#endif

clang/include/clang/Tooling/Syntax/Tree/NodeKind.h

This file was added.

				//===- NodeKind.h - an enum listing nodes of a syntax tree ----- C++ --=====//
				//
				sammccallUnsubmitted Done Reply Inline Actions Why is this separate from Nodes.h? sammccall: Why is this separate from Nodes.h?
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions Not anymore. The original reason was that `Tree.h` need `NodeKind` for implementing casts and the implementation was in a header. ilya-biryukov: Not anymore. The original reason was that `Tree.h` need `NodeKind` for implementing casts and…
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				#ifndef LLVM_CLANG_TOOLING_SYNTAX_TREE_NODEKIND_H
				#define LLVM_CLANG_TOOLING_SYNTAX_TREE_NODEKIND_H
				#include "llvm/ADT/StringRef.h"
				namespace clang {
				namespace syntax {
				/// A kind of a syntax node, used for implementing casts.
				enum class NodeKind {
				Leaf,
				RecoveryNode,
				TranslationUnit,
				TopLevelDecl,
				CompoundStatement,
				};
				/// For debugging purposes.
				llvm::StringRef toString(NodeKind K);

				sammccallUnsubmitted Done Reply Inline Actions if you make this `operator<<`, then it's slightly more flexible I think (llvm::to_string also works). It's not as fast, but I don't think that matters here? sammccall: if you make this `operator<<`, then it's slightly more flexible I think (llvm::to_string also…
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions For some reason I thought `gdb` does not show the enum value names, so I added a named method to simplify debugging. Turns out it does show the enum names, not just int values, so I'm perfectly happy to the stream output operator. ilya-biryukov: For some reason I thought `gdb` does not show the enum value names, so I added a named method…
				} // namespace syntax
				} // namespace clang
				#endif

clang/include/clang/Tooling/Syntax/Tree/NodeList.h

This file was added.

				//===- NodeList.h ---------------------------------------------- C++ --=====//
				//
				sammccallUnsubmitted Done Reply Inline Actions Implementing custom containers is a bit sad. Alternatives we discussed: adapt bumpptrallocactor to std allocator and use std::vector: wastes a pointer from vector->allocator use ArrayRef<Node> or ArrayRef<Node> and require the whole list to be reallocated when children change (but replacing* children is fine) use a linked-list representation instead: Node* Node::NextSibling, CompositeNode::FirstChild. This fits the allocation strategy more nicely. You probably need only single links, and can add an iterator if needed. sammccall: Implementing custom containers is a bit sad. Alternatives we discussed: - adapt…
				ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions No more custom containers, explicit tree structure seems to be better. ilya-biryukov: No more custom containers, explicit tree structure seems to be better.
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				// A helper for storing a mutable of nodes allocated in an arena.
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_CLANG_TOOLING_SYNTAX_TREE_NODELIST_H
				#define LLVM_CLANG_TOOLING_SYNTAX_TREE_NODELIST_H

				#include "llvm/Support/Allocator.h"
				#include <algorithm>
				#include <cstddef>
				#include <type_traits>

				namespace clang {
				namespace syntax {
				namespace detail {
				template <class T> class BumpAllocVector;
				}
				class Node;
				/// Like vector<Node*>, but allocates all the memory in the BumpPtrAllocator.
				/// Can be dropped without running the destructor.
				using NodeList = detail::BumpAllocVector<Node *>;

				namespace detail {
				/// A vector that requests memory from BumpPtrAllocator and has a trivial
				/// destructor. Syntax tree nodes use it to store children.
				/// FIXME: make implementation more sanitizer-friendly to allow catching
				/// out-of-bounds accesses.
				template <class T> class BumpAllocVector {
				// Make sure the elements do not require their destructors to run.
				static_assert(std::is_trivially_destructible<T>::value,
				"T must be trivially destructible");
				// Assumption to simplify the implementation. We only store pointers, so this
				// should hold.
				static_assert(std::is_trivially_copyable<T>::value,
				"T must be trivially copyable.");

				public:
				T *begin() { return Begin; }
				T *end() { return End; }

				const T *begin() const { return Begin; }
				const T *end() const { return End; }

				bool empty() const { return begin() == end(); }
				size_t size() const { return End - Begin; }
				size_t capacity() const { return StorageEnd - Begin; }

				void push_back(llvm::BumpPtrAllocator &A, T Element) {
				if (StorageEnd == End)
				Grow(A);
				*End = Element;
				++End;
				}

				T erase(T Begin, T *End) {
				std::ptrdiff_t ErasedSize = End - Begin;
				for (auto *It = End; It != this->End; ++It) {
				Begin = It;
				++Begin;
				}
				this->End -= ErasedSize;
				return Begin;
				}

				T insert(llvm::BumpPtrAllocator &A, T Pos, T Begin, T End) {
				unsigned ToAdd = End - Begin;
				if (capacity() - size() < ToAdd)
				Grow(A, size() + ToAdd);
				std::rotate(Pos, Pos + ToAdd, End + ToAdd);
				std::copy(Begin, End, Pos);
				this->End += ToAdd;
				return Pos;
				}

				private:
				void Grow(llvm::BumpPtrAllocator &A, unsigned MinSize = 0) {
				size_t Size = End - Begin;

				size_t NewCapacity = 2 * (StorageEnd - Begin);
				if (NewCapacity == 0)
				NewCapacity = 1;
				while (NewCapacity < MinSize)
				NewCapacity = 2 * NewCapacity;

				T *NewStorage = A.Allocate<T>(NewCapacity);
				std::copy(Begin, End, NewStorage);

				A.Deallocate(Begin, StorageEnd - Begin);

				Begin = NewStorage;
				End = NewStorage + Size;
				StorageEnd = NewStorage + NewCapacity;
				}

				private:
				T *Begin = nullptr;
				T *End = nullptr;
				T *StorageEnd = nullptr;
				};
				} // namespace detail

				} // namespace syntax
				} // namespace clang

				#endif

clang/include/clang/Tooling/Syntax/Tree/Nodes.h

This file was added.

				//===- Nodes.h - syntax nodes for C/C++ grammar constructs ----- C++ --=====//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				// Syntax tree nodes for C, C++ and Objective-C grammar constructs.
				//===----------------------------------------------------------------------===//
				#ifndef LLVM_CLANG_TOOLING_SYNTAX_TREE_NODES_H
				#define LLVM_CLANG_TOOLING_SYNTAX_TREE_NODES_H
				#include "clang/Basic/TokenKinds.h"
				#include "clang/Lex/Token.h"
				#include "clang/Tooling/Syntax/Tokens.h"
				#include "clang/Tooling/Syntax/Tree/Cascade.h"
				#include "clang/Tooling/Syntax/Tree/NodeList.h"
				#include "llvm/ADT/ArrayRef.h"
				#include "llvm/ADT/StringRef.h"
				namespace clang {
				namespace syntax {

				/// A root node for a translation unit. Parent is always null.
				class TranslationUnit final : public TreeNode {
				public:
				TranslationUnit() : TreeNode(NodeKind::TranslationUnit) {}
				static bool classof(const Node *N) {
				return N->kind() == NodeKind::TranslationUnit;
				}
				};

				/// A tree node of an unknown kind, i.e. a syntax error or a construct from the
				/// clang AST without a syntax counterpart.
				/// These nodes can appear at any place in the syntax tree.
				class RecoveryNode final : public TreeNode {
				public:
				RecoveryNode() : TreeNode(NodeKind::RecoveryNode) {}
				static bool classof(const Node *N) {
				return N->kind() == NodeKind::RecoveryNode;
				}
				};

				/// FIXME: this node is temporary and will be replaced with nodes for various
				/// 'declarations' and 'declarators' from the C/C++ grammar
				///
				/// Represents any top-level declaration. Only there to give the syntax tree a
				/// bit of structure until we implement syntax nodes for declarations and
				/// declarators.
				class TopLevelDecl final : public TreeNode {
				public:
				TopLevelDecl() : TreeNode(NodeKind::TopLevelDecl) {}
				static bool classof(const Node *N) {
				return N->kind() == NodeKind::TopLevelDecl;
				}
				};

				/// Represents a compound statement.
				class CompoundStatement final : public TreeNode {
				public:
				CompoundStatement() : TreeNode(NodeKind::CompoundStatement) {}
				static bool classof(const Node *N) {
				return N->kind() == NodeKind::CompoundStatement;
				}

				Leaf *lbrace();
				Leaf *rbrace();
				};

				} // namespace syntax
				} // namespace clang
				#endif

clang/lib/Tooling/Syntax/BuildFromAST.cpp

This file was added.

				//===- BuildFromAST.cpp ---------------------------------------- C++ --=====//
				//
				sammccallUnsubmitted Not Done Reply Inline Actions I haven't reviewed this file yet :-) sammccall: I haven't reviewed this file yet :-)
				ilya-biryukovAuthorUnsubmitted Not Done Reply Inline Actions Please do! ilya-biryukov: Please do!
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				#include "clang/AST/RecursiveASTVisitor.h"
				#include "clang/Basic/SourceLocation.h"
				#include "clang/Basic/TokenKinds.h"
				#include "clang/Lex/Lexer.h"
				#include "clang/Tooling/Syntax/Corpus.h"
				#include "clang/Tooling/Syntax/Tokens.h"
				#include "clang/Tooling/Syntax/Tree.h"
				#include "clang/Tooling/Syntax/Tree/Cascade.h"
				#include "clang/Tooling/Syntax/Tree/Nodes.h"
				#include "llvm/ADT/ArrayRef.h"
				#include "llvm/ADT/STLExtras.h"
				#include "llvm/Support/Casting.h"

				using namespace clang;

				/// A helper class for constructing the syntax tree while traversing a clang
				/// AST. The tree is built left-to-right and bottom-up. At each point of the
				/// traversal we maintain a list of currently processed nodes.
				class syntax::TreeBuilder {
				public:
				TreeBuilder(syntax::Corpus &Corpus) : Corpus(Corpus) {}

				syntax::Corpus &corpus() { return Corpus; }

				/// Populate children for \p New, assuming it covers the token range with
				/// tokens covered located at \p First and \p Last (inclusive!).
				/// All currently processed nodes which fall into the range between \p First
				/// and \p Last are added as children of the new node.
				void learnNode(SourceLocation Fist, SourceLocation Last,
				syntax::TreeNode *New);

				/// Add a leaf node for a token starting at \p Loc.
				void learnTokenNode(SourceLocation Loc, tok::TokenKind Kind);

				/// Finish building the tree and create a root TranslationUnit node. No
				/// further calls to learn* methods are allowed after this call.
				void learnRoot();

				/// Consume the root node.
				syntax::TranslationUnit *root() && {
				assert(Root);
				assert(NodesInProgress.empty());
				return Root;
				}

				private:
				struct RangedNode {
				RangedNode(llvm::ArrayRef<syntax::Token> Tokens, syntax::Node *Node)
				: Tokens(Tokens), Node(Node) {}

				llvm::ArrayRef<syntax::Token> Tokens;
				syntax::Node *Node;
				};
				const syntax::Token *findToken(SourceLocation TokLoc) const;

				void learnNodeImpl(const syntax::Token Begin, const syntax::Token End,
				syntax::TreeNode *New);

				syntax::Corpus &Corpus;
				std::vector<RangedNode> NodesInProgress;
				syntax::TranslationUnit *Root = nullptr;
				};

				namespace {
				class BuildTreeVisitor : public RecursiveASTVisitor<BuildTreeVisitor> {
				public:
				explicit BuildTreeVisitor(ASTContext &Ctx, syntax::TreeBuilder &Builder)
				: Builder(Builder), LangOpts(Ctx.getLangOpts()) {}

				bool shouldTraversePostOrder() const { return true; }

				bool TraverseDecl(Decl *D) {
				if (!D \|\| isa<TranslationUnitDecl>(D))
				return RecursiveASTVisitor::TraverseDecl(D);
				if (!llvm::isa<TranslationUnitDecl>(D->getDeclContext()))
				return true; // Only build top-level decls for now, do not recurse.
				return RecursiveASTVisitor::TraverseDecl(D);
				}

				bool TraverseCompoundStmt(CompoundStmt *S) {
				Builder.learnTokenNode(S->getLBracLoc(), tok::l_brace);
				Builder.learnTokenNode(S->getRBracLoc(), tok::r_brace);

				Builder.learnNode(S->getBeginLoc(), S->getEndLoc(),
				corpus().construct<syntax::CompoundStatement>());
				// (!) we do not recurse into compound statements for now.
				return true;
				}

				bool VisitDecl(Decl *D) {
				assert(llvm::isa<TranslationUnitDecl>(D->getDeclContext()) &&
				"expected a top-level decl");
				assert(!D->isImplicit());
				Builder.learnNode(D->getBeginLoc(), D->getEndLoc(),
				corpus().construct<syntax::TopLevelDecl>());
				return true;
				}

				bool WalkUpFromTranslationUnitDecl(TranslationUnitDecl *TU) {
				Builder.learnRoot();
				// (!) we do not want to call VisitDecl() at this point.
				return true;
				}

				private:
				/// A small helper to save some typing.
				syntax::Corpus &corpus() { return Builder.corpus(); }

				syntax::TreeBuilder &Builder;
				const LangOptions &LangOpts;
				};
				} // namespace

				void syntax::TreeBuilder::learnNode(SourceLocation First, SourceLocation Last,
				syntax::TreeNode *New) {
				assert(First.isValid());
				assert(Last.isValid());
				assert(First == Last \|\|
				Corpus.sourceManager().isBeforeInTranslationUnit(First, Last));

				learnNodeImpl(findToken(First), std::next(findToken(Last)), New);
				}

				void syntax::TreeBuilder::learnNodeImpl(const syntax::Token *Begin,
				const syntax::Token *End,
				syntax::TreeNode *New) {
				auto FirstChild =
				std::find_if(NodesInProgress.rbegin(), NodesInProgress.rend(),
				[&](const RangedNode &L) {
				if (&L.Tokens.front() < Begin) {
				assert(&L.Tokens.back() <= End);
				return true;
				}
				return false;
				})
				.base();

				syntax::Corpus &C = corpus();

				auto *NextTok = Begin;
				auto CoverUnknownTokens = [&](const syntax::Token *UpTo) {
				if (NextTok == UpTo)
				return;
				assert(NextTok < UpTo);
				auto *Recovery = C.construct<RecoveryNode>();
				Recovery->appendChildLowLevel(
				C, C.construct<syntax::Leaf>(llvm::makeArrayRef(NextTok, UpTo)));

				New->appendChildLowLevel(C, Recovery);
				NextTok = UpTo;
				};
				for (auto It = FirstChild; It != NodesInProgress.end(); ++It) {
				// Add non-coverred ranges as recovery nodes.
				CoverUnknownTokens(&It->Tokens.front());

				New->appendChildLowLevel(corpus(), It->Node);
				NextTok = It->Tokens.end();
				}
				CoverUnknownTokens(End);

				NodesInProgress.erase(FirstChild, NodesInProgress.end());
				NodesInProgress.push_back(RangedNode{llvm::makeArrayRef(Begin, End), New});
				}

				void syntax::TreeBuilder::learnTokenNode(SourceLocation Loc,
				tok::TokenKind Kind) {
				auto *T = findToken(Loc);
				assert(T->kind() == Kind);

				assert(NodesInProgress.empty() \|\|
				NodesInProgress.back().Tokens.end() <= T &&
				"only allowed to add a node to the end");

				auto Tokens = llvm::makeArrayRef(T, 1);
				syntax::Leaf *New = corpus().construct<syntax::Leaf>(Tokens);
				NodesInProgress.push_back(RangedNode{Tokens, New});
				}

				void syntax::TreeBuilder::learnRoot() {
				auto Tokens = Corpus.tokenBuffer().expandedTokens();
				// Add 'eof' as a separate token node, it should not go into recovery node.
				learnTokenNode(Tokens.back().location(), tok::eof);

				learnNode(Tokens.front().location(), Tokens.back().location(),
				corpus().construct<syntax::TranslationUnit>());

				assert(NodesInProgress.size() == 1);
				assert(NodesInProgress.front().Node->kind() ==
				syntax::NodeKind::TranslationUnit);

				Root = cast<syntax::TranslationUnit>(NodesInProgress.front().Node);
				NodesInProgress.clear();
				}

				const syntax::Token *
				syntax::TreeBuilder::findToken(SourceLocation TokLoc) const {
				auto Tokens = Corpus.tokenBuffer().expandedTokens();
				auto &SM = Corpus.sourceManager();
				auto It =
				std::lower_bound(Tokens.begin(), Tokens.end(), TokLoc,
				[&](const syntax::Token &L, SourceLocation R) {
				return SM.isBeforeInTranslationUnit(L.location(), R);
				});
				assert(It != Tokens.end());
				assert(SM.getFileOffset(It->location()) == SM.getFileOffset(TokLoc));
				return &*It;
				}

				syntax::TranslationUnit *
				syntax::buildSyntaxTree(Corpus &C, const TranslationUnitDecl &TU) {
				TreeBuilder Builder(C);
				BuildTreeVisitor(TU.getASTContext(), Builder).TraverseAST(TU.getASTContext());
				return std::move(Builder).root();
				}

clang/lib/Tooling/Syntax/CMakeLists.txt

	set(LLVM_LINK_COMPONENTS Support)			set(LLVM_LINK_COMPONENTS Support)

	add_clang_library(clangToolingSyntax			add_clang_library(clangToolingSyntax
				BuildFromAST.cpp
				Cascade.cpp
				Corpus.cpp
				Nodes.cpp
	Tokens.cpp			Tokens.cpp
				Tree.cpp

	LINK_LIBS			LINK_LIBS
				clangAST
	clangBasic			clangBasic
	clangFrontend			clangFrontend
	clangLex			clangLex
				clangToolingCore
	)			)

clang/lib/Tooling/Syntax/Cascade.cpp

This file was added.

				//===- Cascade.cpp --------------------------------------------- C++ --=====//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				#include "clang/Tooling/Syntax/Tree/Cascade.h"
				#include "clang/Basic/TokenKinds.h"
				#include "clang/Tooling/Syntax/Tree.h"
				#include "clang/Tooling/Syntax/Tree/NodeList.h"
				#include "llvm/ADT/ArrayRef.h"
				#include "llvm/ADT/STLExtras.h"
				#include "llvm/Support/Casting.h"

				using namespace clang;

				void syntax::TreeNode::appendChildLowLevel(Corpus &C, Node *Child) {
				assert(Child->Parent == nullptr);
				Child->Parent = this;
				Children.push_back(C.allocator(), Child);
				}

				namespace {
				static void dumpTokens(llvm::raw_ostream &OS, ArrayRef<syntax::Token> Tokens,
				const SourceManager &SM) {
				assert(!Tokens.empty());
				bool First = true;
				for (const auto &T : Tokens) {
				if (!First)
				OS << " ";
				else
				First = false;
				// Handle 'eof' separately, calling text() on it produces an empty string.
				if (T.kind() == tok::eof) {
				OS << "<eof>";
				continue;
				}
				OS << T.text(SM);
				}
				}

				static void dumpTree(llvm::raw_ostream &OS, const syntax::Node *N,
				const syntax::Corpus &C, std::vector<bool> IndentMask) {
				if (auto *L = llvm::dyn_cast<syntax::Leaf>(N)) {
				dumpTokens(OS, L->tokens(), C.sourceManager());
				OS << "\n";
				return;
				}

				auto *T = llvm::cast<syntax::TreeNode>(N);
				OS << toString(T->kind()) << "\n";

				for (auto It = T->children().begin(); It != T->children().end(); ++It) {
				for (bool Filled : IndentMask) {
				if (Filled)
				OS << "\| ";
				else
				OS << " ";
				}
				if (std::next(It) == T->children().end()) {
				OS << "`-";
				IndentMask.push_back(false);
				} else {
				OS << "\|-";
				IndentMask.push_back(true);
				}
				dumpTree(OS, *It, C, IndentMask);
				IndentMask.pop_back();
				}
				}
				} // namespace

				std::string syntax::Node::dump(const Corpus &C) const {
				std::string Str;
				llvm::raw_string_ostream OS(Str);
				dumpTree(OS, this, C, /IndentMask=/{});
				return std::move(OS.str());
				}

				std::string syntax::Node::dumpTokens(const Corpus &C) const {
				std::string Storage;
				llvm::raw_string_ostream OS(Storage);
				traverse(this, [&](const syntax::Node *N) {
				auto *L = llvm::dyn_cast<syntax::Leaf>(N);
				if (!L)
				return;
				::dumpTokens(OS, L->tokens(), C.sourceManager());
				});
				return OS.str();
				}

				syntax::Leaf *syntax::TreeNode::findLeaf(tok::TokenKind K) {
				auto It = llvm::find_if(Children, [K](syntax::Node *N) {
				auto *L = dyn_cast<Leaf>(N);
				return L && L->tokens().size() == 1 && L->tokens().front().kind() == K;
				});
				assert(It != Children.end());
				return cast<Leaf>(*It);
				}

clang/lib/Tooling/Syntax/Corpus.cpp

This file was added.

				//===- Corpus.cpp ---------------------------------------------- C++ --=====//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				#include "clang/Tooling/Syntax/Corpus.h"
				#include "clang/Basic/LangOptions.h"
				#include "clang/Lex/Lexer.h"
				#include "llvm/ADT/ArrayRef.h"

				using namespace clang;

				syntax::Corpus::Corpus(SourceManager &SourceMgr, const LangOptions &LangOpts,
				TokenBuffer Tokens)
				: SourceMgr(SourceMgr), LangOpts(LangOpts), Tokens(std::move(Tokens)) {
				}

				const clang::syntax::TokenBuffer &syntax::Corpus::tokenBuffer() const {
				return Tokens;
				}

				std::pair<FileID, llvm::ArrayRef<syntax::Token>>
				syntax::Corpus::tokenizeBuffer(std::unique_ptr<llvm::MemoryBuffer> Input) {
				auto FID = SourceMgr.createFileID(std::move(Input));
				auto It = ExtraTokens.try_emplace(FID, tokenize(FID, SourceMgr, LangOpts));
				assert(It.second && "duplicate FileID");
				return {FID, It.first->second};
				}

clang/lib/Tooling/Syntax/Nodes.cpp

This file was added.

				//===- Nodes.cpp ----------------------------------------------- C++ --=====//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				#include "clang/Tooling/Syntax/Tree/Nodes.h"
				#include "clang/Basic/TokenKinds.h"
				#include "clang/Tooling/Syntax/Tree/NodeKind.h"

				using namespace clang;

				llvm::StringRef syntax::toString(syntax::NodeKind K) {
				switch (K) {
				case NodeKind::TranslationUnit:
				return "translation-unit";
				case NodeKind::RecoveryNode:
				return "recovery-node";
				case NodeKind::Leaf:
				return "tokens";
				case NodeKind::CompoundStatement:
				return "compound-statment";
				case NodeKind::TopLevelDecl:
				return "top-level-decl";
				}
				llvm_unreachable("invalid NodeKind");
				}

				syntax::Leaf *syntax::CompoundStatement::lbrace() {
				return findLeaf(tok::l_brace);
				}

				syntax::Leaf *syntax::CompoundStatement::rbrace() {
				return findLeaf(tok::r_brace);
				}

clang/lib/Tooling/Syntax/Tree.cpp

This file was added.

				//===- Tree.cpp - mutable syntax trees ------------------------- C++ --=====//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				#include "clang/Tooling/Syntax/Tree.h"
				#include "clang/Tooling/Syntax/Tokens.h"
				#include "clang/Tooling/Syntax/Tree/Cascade.h"
				#include "clang/Tooling/Syntax/Tree/Nodes.h"
				#include "llvm/ADT/ArrayRef.h"
				#include "llvm/ADT/STLExtras.h"
				#include "llvm/Support/Casting.h"

				using namespace clang;

				void syntax::traverse(Node N, llvm::function_ref<void(Node )> Visit) {
				if (auto *T = llvm::dyn_cast<TreeNode>(N)) {
				for (auto *C : T->children())
				traverse(C, Visit);
				}
				Visit(N);
				}

				void syntax::traverse(const Node *N,
				llvm::function_ref<void(const Node *)> Visit) {
				return traverse(const_cast<Node *>(N),
				static_cast<llvm::function_ref<void(Node *)>>(Visit));
				}

clang/tools/CMakeLists.txt

	create_subdirectory_options(CLANG TOOL)			create_subdirectory_options(CLANG TOOL)

	add_clang_subdirectory(diagtool)			add_clang_subdirectory(diagtool)
	add_clang_subdirectory(driver)			add_clang_subdirectory(driver)
	add_clang_subdirectory(clang-diff)			add_clang_subdirectory(clang-diff)
	add_clang_subdirectory(clang-format)			add_clang_subdirectory(clang-format)
	add_clang_subdirectory(clang-format-vs)			add_clang_subdirectory(clang-format-vs)
	add_clang_subdirectory(clang-fuzzer)			add_clang_subdirectory(clang-fuzzer)
	add_clang_subdirectory(clang-import-test)			add_clang_subdirectory(clang-import-test)
	add_clang_subdirectory(clang-offload-bundler)			add_clang_subdirectory(clang-offload-bundler)

	add_clang_subdirectory(c-index-test)			add_clang_subdirectory(c-index-test)

	add_clang_subdirectory(clang-rename)			add_clang_subdirectory(clang-rename)
	add_clang_subdirectory(clang-refactor)			add_clang_subdirectory(clang-refactor)
				add_clang_subdirectory(clang-syntax)

	if(CLANG_ENABLE_ARCMT)			if(CLANG_ENABLE_ARCMT)
	add_clang_subdirectory(arcmt-test)			add_clang_subdirectory(arcmt-test)
	add_clang_subdirectory(c-arcmt-test)			add_clang_subdirectory(c-arcmt-test)
	endif()			endif()

	if(CLANG_ENABLE_STATIC_ANALYZER)			if(CLANG_ENABLE_STATIC_ANALYZER)
	add_clang_subdirectory(clang-check)			add_clang_subdirectory(clang-check)
	Show All 14 Lines

clang/unittests/Tooling/Syntax/CMakeLists.txt

	set(LLVM_LINK_COMPONENTS			set(LLVM_LINK_COMPONENTS
	${LLVM_TARGETS_TO_BUILD}			${LLVM_TARGETS_TO_BUILD}
	Support			Support
	)			)

	add_clang_unittest(TokensTest			add_clang_unittest(SyntaxTest
				TreeTest.cpp
	TokensTest.cpp			TokensTest.cpp
	)			)

	target_link_libraries(TokensTest			target_link_libraries(SyntaxTest
	PRIVATE			PRIVATE
	clangAST			clangAST
	clangBasic			clangBasic
	clangFrontend			clangFrontend
	clangLex			clangLex
	clangSerialization			clangSerialization
	clangTooling			clangTooling
	clangToolingSyntax			clangToolingSyntax
	LLVMTestingSupport			LLVMTestingSupport
	)			)

clang/unittests/Tooling/Syntax/TreeTest.cpp

This file was added.

				//===- TreeTest.cpp -------------------------------------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "clang/Tooling/Syntax/Tree.h"
				#include "clang/AST/ASTConsumer.h"
				#include "clang/AST/Decl.h"
				#include "clang/Frontend/CompilerInstance.h"
				#include "clang/Frontend/FrontendAction.h"
				#include "clang/Lex/PreprocessorOptions.h"
				#include "clang/Tooling/Syntax/Corpus.h"
				#include "clang/Tooling/Syntax/Tokens.h"
				#include "clang/Tooling/Syntax/Tree/Nodes.h"
				#include "clang/Tooling/Tooling.h"
				#include "llvm/ADT/STLExtras.h"
				#include "llvm/ADT/StringRef.h"
				#include "gmock/gmock.h"
				#include "gtest/gtest.h"
				#include <cstdlib>

				using namespace clang;

				namespace {
				class SyntaxTreeTest : public ::testing::Test {
				protected:
				// Build a syntax tree for the code.
				syntax::TranslationUnit *buildTree(llvm::StringRef Code) {
				// FIXME: this code is almost the identical to the one in TokensTest. Share
				// it.
				class BuildSyntaxTree : public ASTConsumer {
				public:
				BuildSyntaxTree(syntax::TranslationUnit *&Root,
				std::unique_ptr<syntax::Corpus> &Corpus,
				std::unique_ptr<syntax::TokenCollector> Tokens)
				: Root(Root), Corpus(Corpus), Tokens(std::move(Tokens)) {
				assert(this->Tokens);
				}

				void HandleTranslationUnit(ASTContext &Ctx) override {
				Corpus = llvm::make_unique<syntax::Corpus>(
				Ctx.getSourceManager(), Ctx.getLangOpts(),
				std::move(*Tokens).consume());
				Tokens = nullptr; // make sure we fail if this gets called twice.
				Root = syntax::buildSyntaxTree(Corpus, Ctx.getTranslationUnitDecl());
				}

				private:
				syntax::TranslationUnit *&Root;
				std::unique_ptr<syntax::Corpus> &Corpus;
				std::unique_ptr<syntax::TokenCollector> Tokens;
				};

				class BuildSyntaxTreeAction : public ASTFrontendAction {
				public:
				BuildSyntaxTreeAction(syntax::TranslationUnit *&Root,
				std::unique_ptr<syntax::Corpus> &Corpus)
				: Root(Root), Corpus(Corpus) {}

				std::unique_ptr<ASTConsumer>
				CreateASTConsumer(CompilerInstance &CI, StringRef InFile) override {
				// We start recording the tokens, ast consumer will take on the result.
				auto Tokens =
				llvm::make_unique<syntax::TokenCollector>(CI.getPreprocessor());
				return llvm::make_unique<BuildSyntaxTree>(Root, Corpus,
				std::move(Tokens));
				}

				private:
				syntax::TranslationUnit *&Root;
				std::unique_ptr<syntax::Corpus> &Corpus;
				};

				constexpr const char *FileName = "./input.cpp";
				FS->addFile(FileName, time_t(), llvm::MemoryBuffer::getMemBufferCopy(""));
				// Prepare to run a compiler.
				std::vector<const char *> Args = {"tok-test", "-std=c++03", "-fsyntax-only",
				FileName};
				auto CI = createInvocationFromCommandLine(Args, Diags, FS);
				assert(CI);
				CI->getFrontendOpts().DisableFree = false;
				CI->getPreprocessorOpts().addRemappedFile(
				FileName, llvm::MemoryBuffer::getMemBufferCopy(Code).release());
				CompilerInstance Compiler;
				Compiler.setInvocation(std::move(CI));
				if (!Diags->getClient())
				Diags->setClient(new IgnoringDiagConsumer);
				Compiler.setDiagnostics(Diags.get());
				Compiler.setFileManager(FileMgr.get());
				Compiler.setSourceManager(SourceMgr.get());

				syntax::TranslationUnit *Root = nullptr;
				BuildSyntaxTreeAction Recorder(Root, this->Corpus);
				if (!Compiler.ExecuteAction(Recorder)) {
				ADD_FAILURE() << "failed to run the frontend";
				std::abort();
				}
				return Root;
				}

				// Adds a file to the test VFS.
				void addFile(llvm::StringRef Path, llvm::StringRef Contents) {
				if (!FS->addFile(Path, time_t(),
				llvm::MemoryBuffer::getMemBufferCopy(Contents))) {
				ADD_FAILURE() << "could not add a file to VFS: " << Path;
				}
				}

				// Data fields.
				llvm::IntrusiveRefCntPtr<DiagnosticsEngine> Diags =
				new DiagnosticsEngine(new DiagnosticIDs, new DiagnosticOptions);
				IntrusiveRefCntPtr<llvm::vfs::InMemoryFileSystem> FS =
				new llvm::vfs::InMemoryFileSystem;
				llvm::IntrusiveRefCntPtr<FileManager> FileMgr =
				new FileManager(FileSystemOptions(), FS);
				llvm::IntrusiveRefCntPtr<SourceManager> SourceMgr =
				new SourceManager(Diags, FileMgr);
				// Set after calling buildTree().
				std::unique_ptr<syntax::Corpus> Corpus;
				};

				TEST_F(SyntaxTreeTest, Basic) {
				std::pair</Input/ std::string, /Expected/ std::string> Cases[] = {{
				R"cpp(
				int main() {}
				void foo() {}
				)cpp",
				R"txt(
				translation-unit
				\|-top-level-decl
				\| \|-recovery-node
				\| \| `-int main ( )
				\| `-compound-statment
				\| \|-{
				\| `-}
				\|-top-level-decl
				\| \|-recovery-node
				\| \| `-void foo ( )
				\| `-compound-statment
				\| \|-{
				\| `-}
				`-<eof>
				)txt"}};

				for (const auto &T : Cases) {
				auto *Root = buildTree(T.first);
				std::string Expected = llvm::StringRef(T.second).trim().str();
				std::string Actual = llvm::StringRef(Root->dump(*Corpus)).trim();
				EXPECT_EQ(Expected, Actual) << "the resulting dump is:\n" << Actual;
				}
				}
				} // namespace

This is an archive of the discontinued LLVM Phabricator instance.

[Syntax] Introduce syntax treesAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 198427

clang/include/clang/Tooling/Syntax/Corpus.h

clang/include/clang/Tooling/Syntax/Tree.h

clang/include/clang/Tooling/Syntax/Tree/Cascade.h

clang/include/clang/Tooling/Syntax/Tree/NodeKind.h

clang/include/clang/Tooling/Syntax/Tree/NodeList.h

clang/include/clang/Tooling/Syntax/Tree/Nodes.h

clang/lib/Tooling/Syntax/BuildFromAST.cpp

clang/lib/Tooling/Syntax/CMakeLists.txt

clang/lib/Tooling/Syntax/Cascade.cpp

clang/lib/Tooling/Syntax/Corpus.cpp

clang/lib/Tooling/Syntax/Nodes.cpp

clang/lib/Tooling/Syntax/Tree.cpp

clang/tools/CMakeLists.txt

clang/unittests/Tooling/Syntax/CMakeLists.txt

clang/unittests/Tooling/Syntax/TreeTest.cpp

[Syntax] Introduce syntax trees
AbandonedPublic