This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
include/clang/Tooling/Syntax/Pseudo/
-
clang/
-
Tooling/
-
Syntax/
-
Pseudo/
14/16
Forest.h
22
GLRParser.h
-
lib/Tooling/Syntax/Pseudo/
-
Tooling/
-
Syntax/
-
Pseudo/
-
CMakeLists.txt
3
Forest.cpp
15
GLRParser.cpp
-
tools/clang-pseudo/
-
clang-pseudo/
2
ClangPseudo.cpp

Differential D121150

[pseudo] Implement the GLR parsing algorithm.
ClosedPublic

Authored by hokein on Mar 7 2022, 12:54 PM.

Download Raw Diff

Details

Reviewers

sammccall

Commits

rG9f38da258ea7: [pseudo] Implement the GLR parsing algorithm.
rGeac22d0754f7: [pseudo] Implement the GLR parsing algorithm.

Summary

This patch implements a standard GLR parsing algorithm, the core piece of the pseudoparser:

it parses preprocessed C++ code, currently it supports correct code only and parse them as a translation-unit;
it produces a forest which stores all possible trees in an efficient manner (only a single node being build for per (SymbolID, Token Range)); no disambiguation yet;

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hokein created this revision.Mar 7 2022, 12:54 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 7 2022, 12:54 PM

Herald added subscribers: mgrang, mgorny. · View Herald Transcript

hokein requested review of this revision.Mar 7 2022, 12:54 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 7 2022, 12:54 PM

Herald added a subscriber: alextsao1999. · View Herald Transcript

This is an initial version, not unittests yet, but it should be good enough for high-level reviews.

Basically, it contains all key pieces of the GLR parser:

ForestForest: it is a DAG with compact nodes (8 bytes per node). It is not mutable by design. The code is mostly derived from our prototype, and might need some improvements.

Graph-structured stack: the GSS is a simple implementation -- only allocation for new nodes, and no deallocation for dead nodes. We should figure out whether it is worth the effort to implement a smart deallocation. The GSS only affects the peak memory usage during parsing, it can be thrown away after we build the forest. In addition, the forest node is stored in the graph node, rather than the edge (per our discussion, it felt more natural and better fit to our mental model to store forest nodes in edges, but I found that the implementation was awkward, and finally abandoned that).

Core GLR parsing algorithm: it should be in a good state for review, it is missing the bit of producing ambiguous forest-node (as we use a non-mutable forest, this'd require some careful ordering when performing reduce actions, the implementation is tricky and takes some complexity, plan to do it in a follow up patch).

Some misc questions:

is the current output of the forest ok? (I think in general it is ok)
Any ideas about testing & debugging? Verifying the output forest is a way to test the GLR parser, but it doesn't seem to be an ideal way if we want to inspect the internal states of the parser during parsing. The alternative is to define a logger interface, the parser can invoke it at some points, so that we can inject some observers into the parser, and can use the log messages for testing and debugging purposes (see the LLVM_DEBUG usage in GLRParser.cpp).

some tweaks.

Harbormaster completed remote builds in B153005: Diff 413609.Mar 7 2022, 1:39 PM

Nice! Some early comments, I haven't gotten deep into the GLR algorithm itself

clang/include/clang/Tooling/Syntax/Pseudo/Forest.h
9	nit: i think we should reverse the emphasis: the one-sentence summary should say what this models (a set of possible parse trees) and then elaborate below that it's the output of GLR
12	I think we should be a bit more verbose than "forest is a dag" since this is very confusing/surprising (at least to me). Maybe A parse forest represents a set of possible parse trees." Despite the name, the data structure is a parse-tree-like DAG with a single root. Multiple ways to parse the same tokens are represented as an Ambiguous node with the possible interpretations as children. Common sub-parses are shared: if two interpretations both parse the token range "1 + 1" as expr := expr + expr, they will share a Sequence node representing this.
13	nit: numberous -> numerous
13	wording is a bit misleading here: suggests that the primary virtue is being space efficient when really it eliminates other forms of redundancy too (like traversal) calling these "subtrees" is a bit confusing since they're not trees (but represent sets of trees)
29	We should say something more than this about node structure: nodes represent ways to interpret a sequence of tokens a node always interprets a fixed set of tokens as a fixed symbol kind some nodes may have children, and have pointers to them all nodes may have multiple parents, but do not know them
30	I wonder if we should put these in a namespace `forest::Node` rather than giving them the "Forest" prefix?
46	this sentence doesn't make sense
57	maybe startIndex or startToken? Loc reminds me too much of SourceLocation...

sammccall added inline comments.Mar 7 2022, 4:00 PM

clang/include/clang/Tooling/Syntax/Pseudo/Forest.h
111	it seems weird to have this state... can we replace init + terminal() with `ArrayRef<ForestNode> createTerminals(const TokenStream&)` and have the parser responsible for keeping track of pointers (as it does for other nodes)? This would also make this fit the pure "arena" concept better. I'm digging through notes on the prototype and I'm not convinced that my reasons for keeping track of these in the arena make sense. (They were about ensuring equivalent nodes are shared between heuristic vs grammar parser, but I'm not sure why this only needs to be done for terminals, and there are other ways to do it)
clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h
114	this function never exposes an interesting graph, as it's mostly empty before + after parse(). It's only really useful for working out how much memory is allocated. If we were to drop this, the public interface of GLRParser would be a single function (so it need not be a public class) and Graph would disappear entirely! With this in mind, it seems like we could replace this header file with something like: struct ParseStats { unsigned GSSBytes = 0; }; ForestNode glrParse(const TokenStream &, const LRTable&, const Grammar&, ForestArena&, ParseStats Stats = nullptr); It feels like hiding the GSS might be an obstacle to debugging/testing. However this really would need a story/API around how to interrupt the parser. LLVM_DEBUG seems reasonable to me for now.
clang/lib/Tooling/Syntax/Pseudo/Forest.cpp
55	commented out code
74	a few of these comments have the wrong case on the names
clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp
33	It actually currently looks like it would be at least as natural to have the parser operate on the sequence of terminal ForestNodes rather than on the Tokens themselves...
64	Returning null when the input doesn't match the grammar is not at all what we want. I know it's early, but I'm worried we're defining an API for the GLR parser that matches the textbook rather than the one we need. (Similar to concerns about the Accept action - we're going to accept any stream of tokens). I think we need to plan a little bit how error recovery is going to work here: do we plan to call this recursively inside brackets, or handle lazy-brackets in the GLR parser itself? will sequence recovery (after `,` or between declarations) happen in the GLR parser or externally? in this top loop? how will parsing continue? where would splitting heuristics happen (like interpreting `x? = y?` as an assignment expression, or `foo(; bar;` as two statements) in the worst case, when we're parsing something with no sequence recovery (say an expression) and we don't match the grammar, no splitting rule applies etc... are we going to create an opaque node? Does this function do it or the caller? I suspect it would be a better fit to: a) have an Opaque node, if not in this patch then very soon b) operate on an ArrayRef<Token> or ArrayRef<ForestNode>, and stop relying on eof to delimit (so we can parse subranges if needed) c) stop relying on Accept actions, since we'll want to look at the stack when we run out of tokens in general, and Accept just models a trivial case of that

fix a subtle bug where we might leave an unremoved node in the reduce path.

Harbormaster completed remote builds in B153142: Diff 413788.Mar 8 2022, 6:59 AM

sammccall added inline comments.Mar 8 2022, 8:02 AM

clang/include/clang/Tooling/Syntax/Pseudo/Forest.h
30	alignas(ForestNode*)
67	dump(), dumpRecursive()
78	Rule, Elements
90	remove trailing _
91	ForestNode*
100	count()?
clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h
40	the fact that this uses a DAG seems less important than what it represents. I think the next sentence is a better introduction, and then we should describe what data the stacks are modelling, finally we should describe the data structure after (see comment below).
43	as with forest, I think this places too much emphasis on the memory-compactness when it's not the main benefit. (If RAM were free we still wouldn't want a big array of stacks).
43	To address the two above comments, maybe something like. A Graph-Structured Stack represents the multiple parse stacks for a generalized LR parser. Each node stores a parse state, the last parsed ForestNode, and its parent(s). There may be several heads (top of stack), and the parser operates by: - shift: pushing terminal symbols on top of the stack - reduce: replace N symbols on top of the stack with one nonterminal The structure is a DAG rather than a stack: - GLR allows multiple actions on the same head, producing forks (nodes with the same parent). - The parser reconciles nodes with the same (state, ForestNode), producing joins (node with multiple parents). The parser is responsible for creating nodes, and keeping track of the set of heads.
46	combing -> combining I wonder if combining is the right explanation here - unlike local ambiguity packing, it's not like we produce two things and then merge them. What about rather: `sharing stack prefixes: when the parser must take two conflicting paths, it creates two new stack head nodes with the same parent.`
48	combing -> combining euqal -> equal
48	the "-- as..." clause is a bit confusing and doesn't seem necessary. I think what's missing here is a high level explanation of what's going on in the parse to trigger this, and explicitly mentioning this is how we get a node with multiple parents. When alternative parses converge on the same interpretation of later tokens, their stack heads end up in the same state. These are merged, resulting in a single head with several parents.
53	this shows forking but not merging and in particular it suggests that #heads == #stacks, which is not the case
54	arrows are pointing the wrong way (at least opposite to our pointers)
58	the name "Graph" is IMO too general, this isn't a reusable graph or even dag class. I think GSS is fine, it's a distinctive name from the literature, and the expansion of the abbreviation is nice and descriptive (but too verbose to be the actual name)
59	this comment doesn't say anything, remove?
60	alignas(Node) (in practice this will match ForestNode so is a no-op, but it documents the intent)
61	again, drop comment unless there's something to say
64	This is not what predecessor usually means in graph theory: u is the predecessor of v if there is some path from u->v. I think "parent" is a common and well-understood term.
68	Also not sure what this comment says: first line just repeats the type: type is a forest node, forest nodes are always for symbols, and terminal/nonterminal are all the possibilities second line is either referring to or defining (not sure) edge labels, but edge labels aren't defined or referred to anywhere else, so why? Maybe: The parse tree for the last symbol we parsed. This symbol appears on the left of the dot in the parse state.
77	why is Parsed not part of the identity? (maybe there's a good reason, but there should be a comment)
87	8 parents isn't that many and it seems like this might be a dynamic property of the input code rather than a static property of the grammar. But I don't think this bitpacking is buying anything, it looks like the layout is: State : 13 PredecessorCount : 3 (padding) : 48 Parsed : 64 So I think we might as well just use a uint16 PredecessorCount and still have room left over for a uint16 refcount later. Is there anything else we might usefully store in the extra space?
100	name consistently with forest arena
112	as discussed offline, the signature here should involve a token range, or likely an ArrayRef<ForestNode> for the relevant terminals. Probably a FIXME to allow the start symbol to be specified.
114	After offline discussion I think we either want: to hide GSS as mentioned above, and just test the forest output expose GSS class, and have a two versions of the parse function: one that finalizes the GSS into a result ForestNode, and a "raw" one that just returns the GSS. Then we can get at the GSS state at a point by running the raw parser on some prefix of the code. (Or even only have the raw one and make it the caller's responsibility to extract the node from the GSS, but this is probably silly)
129	I love the name frontier! Unfortunately the word refers to the whole boundary/set, rather than individual elements of it. Maybe this? /// The list of actions we're ready to perform. struct { std::vector<Pending> Shift; std::vector<Pending> Reduce; std::vector<Pending> Accept; bool empty(); // accept is an odd-one-out here, but I think it's going away? } Frontier; Also maybe a FIXME that the frontier needs to order reductions so that local ambiguity packing is guaranteed to work as a single pass
133	just Action?
clang/lib/Tooling/Syntax/Pseudo/Forest.cpp
30	unhandled
clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp
33	I'd suggest calling this NextTok instead of Lookahead: in common english this is a verb rather than a noun in parsers it more often refers to the number of tokens in the tables than the tokens themselves referring to the token you're currently shifting as "ahead" is bizarre!
67	if this is a hot loop we shouldn't be creating std::vectors and throwing them away
80	we shift tokens, not states: the token conceptually moves from the input to the stack. (This seems to be other references not just me!) if multiple stack heads will reach the same state after shifting a token?
85	I think the arrows directions aren't particularly clear (our pointers go the opposite way) or necessary here, and backticks and up arrows are hard to read. A couple of (widely-supported) box-drawing characters would help a lot I think: 0--1--2 └--3 0--1--2--4 └--3--┘ (I would keep using `-` and `\|` rather than making everything pretty boxes, it's readable enough and easier to edit. We may want to consider box-drawing characters for dump functions though...)
93	I can't parse `a "perform" shift`. Batch shifts by target state so we can merge matching groups?
97	I don't think tiebreaking by `Head` achieves anything. If we really need to ensure deterministic behavior, comparing pointers won't do that as allocation may vary. On the other hand, `stable_sort` should work: we'll tiebreak by order enqueued, so we're deterministic if our inputs are.
108	SmallVector
110	llvm::is_contained
114	if this turns out to be hot, you could avoid the temporary array of predecessors: return a mutable node from addNode
128	name doesn't match return type
131	I suspect `S{0}` will be verbose enough, it's not likely spelling out `state` will make this much less cryptic, but it may be harder to scan
141	We're leaning on std::function a lot for code that's supposedly in the hot path. It's probably fine, but I'd like to see this written more directly. As far as I can tell, this could easily be a class, with enumerateReducePath a (recursive) method that concretely calls into handleReducePath() or something at the bottom. Also if it's a class we can easily reset() instead of destroying it if we want to reuse its vectors across separate reductions.
158	(just a note, I need to dig into this part tomorrow!)
clang/tools/clang-pseudo/ClangPseudo.cpp
33	nit: a C++ source file => a source file
33	Why is this a new flag independent of `Source`? It also seems inconsistent with other flags, which describe what output is requested rather than what computation should be done. Maybe "-print-forest" and "-print-stats"? (which could also potentially control stats of other stages)

hokein mentioned this in D121368: [pseudo][WIP] Build Ambiguous forest node in the GLR Parser..Mar 10 2022, 3:07 AM

hokein mentioned this in D122139: [pseudo] Introduce parse forest..Mar 21 2022, 7:39 AM

splitting the forest data structure to https://reviews.llvm.org/D122139, comments around Forest.h/.cpp should be addressed.

clang/include/clang/Tooling/Syntax/Pseudo/Forest.h
57	renamed to `startTokenIndex`.

hokein mentioned this in D122303: [pseudo] Sort nonterminals based on their reduction order..Mar 23 2022, 3:57 AM

hokein mentioned this in rGf383b88d8214: [pseudo] Sort nonterminals based on their reduction order..Mar 24 2022, 6:31 AM

hokein mentioned this in rG62d5f254ccd0: [pseudo] Introduce parse forest..Mar 24 2022, 6:49 AM

sammccall mentioned this in D122408: [pseudo] [WIP2] Implement GLR parser.Mar 24 2022, 9:15 AM

Updates:

a derived version of D122408 and D121368;
refine the APIs, getting rid of the GLR parser, and providing fine-grained pieces to allow writing tests easier;
add unittests for the algorithm and a simple smoke lit test;
when we fail to parse the input, we return an opaque forest node rather than a nullptr;
rebase to the main branch;

Herald added a project: Restricted Project. · View Herald TranscriptMay 2 2022, 12:35 PM

Fix the bad format from lint.

Harbormaster completed remote builds in B162307: Diff 426493.May 2 2022, 1:17 PM

This looks really good, great job on the signatures & tests.
There are a few different ways to formulate the signatures of glrParse/glrReduce, and some possible optimizations, but I can't see anything that's an obvious improvement worth holding up over.

clang-tools-extra/pseudo/include/clang-pseudo/GLR.h
71 ↗	(On Diff #426493)	nit: pointers are stored as trailing objects, not the parents themselves
143 ↗	(On Diff #426493)	nit: this is always used synchronously, so llvm::function_ref?
146 ↗	(On Diff #426493)	the comment says newly-created heads are passed, but actually their inputs are passed and the callback is responsible for creating them. (Why not have glrShift create the node and pass it to the callback? Or maybe even pass the output vector of NewHeads& in for concreteness?)
147 ↗	(On Diff #426493)	Maybe also mention the interaction with PendingShift? "When this function returns, PendingShift is empty."?
151 ↗	(On Diff #426493)	nit: semantics of the callback aren't obvious, so I think "NewHead" is a better name than "CB"
153 ↗	(On Diff #426493)	When this function returns, PendingReduce is empty. Calls to NewHeadCB may add elements to PendingReduce
clang-tools-extra/pseudo/tool/ClangPseudo.cpp
39 ↗	(On Diff #426493)	nit: just "print statistics"? I think this should be orthogonal to other options
clang-tools-extra/pseudo/unittests/GLRTest.cpp
34 ↗	(On Diff #426493)	this looks so much like a GSS node: why not just use a GSS node?
137 ↗	(On Diff #426493)	bind is ugly :-( maybe just have GLRTest::captureNewHeads() return a std::function with the right signature?

This revision is now accepted and ready to land.May 3 2022, 4:02 AM

address remaining comments:

return GSS::node in the NewHeadCallback, rather the fields of GSS::Node;
remove the unncessary NewHeadResult structure in unittest;

hokein retitled this revision from [pseudo][WIP] Implement a GLR parser. to [pseudo] Implement the GLR parsing algorithm. .May 3 2022, 6:27 AM

hokein edited the summary of this revision. (Show Details)

hokein added inline comments.May 3 2022, 6:32 AM

clang-tools-extra/pseudo/include/clang-pseudo/GLR.h
143 ↗	(On Diff #426493)	we have a `captureNewHeads` method which returns this callback in unittest, returning llvm::function_ref is not safe -- we could make it return a std::function, but it would be nice to have a unified signature.
clang-tools-extra/pseudo/unittests/GLRTest.cpp
34 ↗	(On Diff #426493)	oh, right. I added this because I didn't expose the GSS in the `Params` in my previous version, I needed to store the Parents. Right now it is not needed, we can use the GSS::Node directly.

fix a dangling reference of the source text in clang-pseudo.

This revision was landed with ongoing or failed builds.May 3 2022, 6:43 AM

Closed by commit rGeac22d0754f7: [pseudo] Implement the GLR parsing algorithm. (authored by sammccall, committed by hokein). · Explain Why

This revision was automatically updated to reflect the committed changes.

hokein added a commit: rGeac22d0754f7: [pseudo] Implement the GLR parsing algorithm..

hokein added a reverting change: rG860eabb3953a: Revert "[pseudo] Implement the GLR parsing algorithm.".May 3 2022, 6:55 AM

Harbormaster completed remote builds in B162445: Diff 426680.May 3 2022, 7:16 AM

hokein added a commit: rG9f38da258ea7: [pseudo] Implement the GLR parsing algorithm..May 3 2022, 11:28 AM

Revision Contents

Path

Size

clang/

include/

clang/

Tooling/

Syntax/

Pseudo/

Forest.h

135 lines

GLRParser.h

148 lines

lib/

Tooling/

Syntax/

Pseudo/

CMakeLists.txt

2 lines

Forest.cpp

125 lines

GLRParser.cpp

330 lines

tools/

clang-pseudo/

ClangPseudo.cpp

23 lines

Diff 413609

clang/include/clang/Tooling/Syntax/Pseudo/Forest.h

This file was added.

				//===--- Forest.h - Parse forest, the output of the GLR parser ---- C++--===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// Parse forest is the output of the GLR parser.
				sammccallUnsubmitted Done Reply Inline Actions nit: i think we should reverse the emphasis: the one-sentence summary should say what this models (a set of possible parse trees) and then elaborate below that it's the output of GLR sammccall: nit: i think we should reverse the emphasis: the one-sentence summary should say what this…
				//
				// For an ambiguous grammar, there might be multiple parse trees generated from
				// for the given input. Forest is a DAG which represent numberous possible in a
				sammccallUnsubmitted Done Reply Inline Actions I think we should be a bit more verbose than "forest is a dag" since this is very confusing/surprising (at least to me). Maybe A parse forest represents a set of possible parse trees." Despite the name, the data structure is a parse-tree-like DAG with a single root. Multiple ways to parse the same tokens are represented as an Ambiguous node with the possible interpretations as children. Common sub-parses are shared: if two interpretations both parse the token range "1 + 1" as expr := expr + expr, they will share a Sequence node representing this. sammccall: I think we should be a bit more verbose than "forest is a dag" since this is very…
				// space-efficient manner. Common subtrees are shared -- if two or more trees
				sammccallUnsubmitted Done Reply Inline Actions nit: numberous -> numerous sammccall: nit: numberous -> numerous
				sammccallUnsubmitted Done Reply Inline Actions wording is a bit misleading here: suggests that the primary virtue is being space efficient when really it eliminates other forms of redundancy too (like traversal) calling these "subtrees" is a bit confusing since they're not trees (but represent sets of trees) sammccall: wording is a bit misleading here: - suggests that the primary virtue is being space efficient…
				// treat the token range [1, 3) as an Expression, then there is a single shared
				// Expression node representing the subparse in the forest.
				//
				//===----------------------------------------------------------------------===//

				#include "clang/Tooling/Syntax/Pseudo/Grammar.h"
				#include "clang/Tooling/Syntax/Pseudo/Token.h"
				#include "llvm/ADT/ArrayRef.h"
				#include "llvm/Support/Allocator.h"
				#include <cstdint>

				namespace clang {
				namespace syntax {
				namespace pseudo {

				// A node in a forest.
				sammccallUnsubmitted Done Reply Inline Actions We should say something more than this about node structure: nodes represent ways to interpret a sequence of tokens a node always interprets a fixed set of tokens as a fixed symbol kind some nodes may have children, and have pointers to them all nodes may have multiple parents, but do not know them sammccall: We should say something more than this about node structure: - nodes represent ways to…
				class ForestNode {
				sammccallUnsubmitted Not Done Reply Inline Actions I wonder if we should put these in a namespace `forest::Node` rather than giving them the "Forest" prefix? sammccall: I wonder if we should put these in a namespace `forest::Node` rather than giving them the…
				sammccallUnsubmitted Done Reply Inline Actions alignas(ForestNode) sammccall:* alignas(ForestNode*)
				public:
				enum Kind : uint8_t {
				// A Terminal node is a single terminal symbol bound to a token.
				Terminal,
				// A Sequence node is a nonterminal symbol parsed with a grammar rule.
				// elements() are the parses of each symbol on the RHS of the rule.
				Sequence,
				// An Ambiguous node exposes multiple ways to match the code to the symbol.
				// alternatives() are the possible parses, we should choose one.
				Ambiguous,
				};
				Kind kind() const { return K; }

				SymbolID symbol() const { return Symbol; }

				// The parses for each element in the RHS of the rule.
				sammccallUnsubmitted Done Reply Inline Actions this sentence doesn't make sense sammccall: this sentence doesn't make sense
				// REQUIRES: this is a Sequence node.
				RuleID rule() const {
				assert(kind() == Sequence);
				return Data_ & ((1 << RuleBits) - 1);
				}
				// REQUIRES: this is a Sequence node;
				llvm::ArrayRef<const ForestNode *> elements() const {
				assert(kind() == Sequence);
				return children(Data_ >> RuleBits);
				};
				uint32_t startLoc() const { return StartLoc; }
				sammccallUnsubmitted Done Reply Inline Actions maybe startIndex or startToken? Loc reminds me too much of SourceLocation... sammccall: maybe startIndex or startToken? Loc reminds me too much of SourceLocation...
				hokeinAuthorUnsubmitted Done Reply Inline Actions renamed to `startTokenIndex`. hokein: renamed to `startTokenIndex`.

				// The possible interpretations of the code.
				// REQUIRES: this is an Ambiguous node.
				llvm::ArrayRef<const ForestNode *> alternatives() const {
				assert(kind() == Ambiguous);
				return children(Data_);
				}

				std::string Dump(const Grammar &) const;
				std::string DumpRecursive(const Grammar &, bool abbreviated = false) const;
				sammccallUnsubmitted Done Reply Inline Actions dump(), dumpRecursive() sammccall: dump(), dumpRecursive()

				private:
				friend class ForestArena;
				ForestNode(Kind K, SymbolID Symbol, Token::Index StartLoc, uint16_t Data)
				: StartLoc(StartLoc), K(K), Symbol(Symbol), Data_(Data) {}

				llvm::ArrayRef<const ForestNode *> children(uint16_t Num) const {
				return llvm::makeArrayRef(
				reinterpret_cast<const ForestNode const >(this + 1), Num);
				}
				static uint16_t SequenceData(RuleID rule,
				sammccallUnsubmitted Done Reply Inline Actions Rule, Elements sammccall: Rule, Elements
				llvm::ArrayRef<const ForestNode *> elements) {
				assert(rule < (1 << RuleBits));
				assert(elements.size() < (1 << (16 - RuleBits)));
				return rule \| elements.size() << RuleBits;
				}
				Token::Index StartLoc;
				Kind K : 4;
				SymbolID Symbol : SymbolBits;
				// Sequences - child count : 4 \| RuleID : 12
				// Ambiguous - child count : 16
				// Terminal - unused
				uint16_t Data_;
				sammccallUnsubmitted Done Reply Inline Actions remove trailing _ sammccall: remove trailing _
				// A trailing array of Node* .
				sammccallUnsubmitted Done Reply Inline Actions ForestNode* sammccall: ForestNode*
				};

				// Node may not be destroyed (for BumpPtrAllocator).
				static_assert(std::is_trivially_destructible<ForestNode>(), "");

				// A memory arena for the parse forest.
				class ForestArena {
				public:
				size_t nodeNum() const { return NodeNum; }
				sammccallUnsubmitted Done Reply Inline Actions count()? sammccall: count()?
				size_t bytes() const { return Arena.getBytesAllocated() + sizeof(this); }

				ForestNode &createSequence(SymbolID SID, RuleID RID, Token::Index Start,
				llvm::ArrayRef<const ForestNode *> Elements) {
				return create(ForestNode::Sequence, SID, Start,
				ForestNode::SequenceData(RID, Elements), Elements);
				}

				void init(const TokenStream &Code);
				const ForestNode &terminal(Token::Index Index) const {
				assert(Terminals && "Terminals are not intialized!");
				sammccallUnsubmitted Not Done Reply Inline Actions it seems weird to have this state... can we replace init + terminal() with `ArrayRef<ForestNode> createTerminals(const TokenStream&)` and have the parser responsible for keeping track of pointers (as it does for other nodes)? This would also make this fit the pure "arena" concept better. I'm digging through notes on the prototype and I'm not convinced that my reasons for keeping track of these in the arena make sense. (They were about ensuring equivalent nodes are shared between heuristic vs grammar parser, but I'm not sure why this only needs to be done for terminals, and there are other ways to do it) sammccall: it seems weird to have this state... can we replace init + terminal() with…
				return Terminals[Index];
				}

				private:
				ForestNode &create(ForestNode::Kind K, SymbolID SID, Token::Index Start,
				uint16_t Data,
				llvm::ArrayRef<const ForestNode *> Elements) {
				++NodeNum;
				ForestNode *New = new (Arena.Allocate(
				sizeof(ForestNode) + Elements.size() * sizeof(ForestNode *),
				alignof(ForestNode))) ForestNode(K, SID, Start, Data);
				if (!Elements.empty())
				llvm::copy(Elements, reinterpret_cast<const ForestNode **>(New + 1));
				return *New;
				}

				llvm::BumpPtrAllocator Arena;
				ForestNode *Terminals = nullptr;
				uint32_t NodeNum = 0;
				};

				} // namespace pseudo
				} // namespace syntax
				} // namespace clang

clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h

This file was added.

				//===--- GLRParser.h - Implement a standard GLR parser ------------ C++--===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This implements a standard Generalized LR (GLR) parsing algorithm.
				//
				// The GLR parser behaves as a normal LR parser until it encounters a conflict.
				// To handle a conflict (where there are multiple actions could perform), the
				// parser will simulate nondeterminism by doing a breadth-first search
				// over all the possibilities.
				//
				// Basic mechanisims of the GLR parser:
				// - A number of processes are operated in parallel.
				// - Each process has its own parsing stack and behaves as a standard
				// determinism LR parser.
				// - When a process encounters a conflict, it will be fork (one for each
				// avaiable action).
				// - When a process encounters an error, it is abandoned.
				// - All process are synchronized by the lookahead token: they perfrom shift
				// action at the same time, which means some processes need wait until other
				// processes have performed all reduce actions.
				//
				//===----------------------------------------------------------------------===//

				#include "clang/Tooling/Syntax/Pseudo/Forest.h"
				#include "clang/Tooling/Syntax/Pseudo/Grammar.h"
				#include "clang/Tooling/Syntax/Pseudo/LRTable.h"
				#include "clang/Tooling/Syntax/Pseudo/Token.h"
				#include "llvm/Support/Allocator.h"
				#include <vector>

				namespace clang {
				namespace syntax {
				namespace pseudo {

				// An implementation of a directed acyclic graph (DAG), used as a
				sammccallUnsubmitted Not Done Reply Inline Actions the fact that this uses a DAG seems less important than what it represents. I think the next sentence is a better introduction, and then we should describe what data the stacks are modelling, finally we should describe the data structure after (see comment below). sammccall: the fact that this uses a DAG seems less important than what it represents. I think the next…
				// graph-structured stack (GSS) in the GLR parser.
				//
				// GSS is an efficient data structure to represent multiple active stacks, it
				sammccallUnsubmitted Not Done Reply Inline Actions as with forest, I think this places too much emphasis on the memory-compactness when it's not the main benefit. (If RAM were free we still wouldn't want a big array of stacks). sammccall: as with forest, I think this places too much emphasis on the memory-compactness when it's not…
				sammccallUnsubmitted Not Done Reply Inline Actions To address the two above comments, maybe something like. A Graph-Structured Stack represents the multiple parse stacks for a generalized LR parser. Each node stores a parse state, the last parsed ForestNode, and its parent(s). There may be several heads (top of stack), and the parser operates by: - shift: pushing terminal symbols on top of the stack - reduce: replace N symbols on top of the stack with one nonterminal The structure is a DAG rather than a stack: - GLR allows multiple actions on the same head, producing forks (nodes with the same parent). - The parser reconciles nodes with the same (state, ForestNode), producing joins (node with multiple parents). The parser is responsible for creating nodes, and keeping track of the set of heads. sammccall: To address the two above comments, maybe something like. ``` A Graph-Structured Stack…
				// employs a stack-combination optimization to avoid potentially exponential
				// growth of the stack:
				// - combing equal stack prefixes -- A new stack doesn't need to have a full
				sammccallUnsubmitted Not Done Reply Inline Actions combing -> combining I wonder if combining is the right explanation here - unlike local ambiguity packing, it's not like we produce two things and then merge them. What about rather: `sharing stack prefixes: when the parser must take two conflicting paths, it creates two new stack head nodes with the same parent.` sammccall: combing -> combining I wonder if combining is the right explanation here - unlike local…
				// copy of its parent’s stack. They share a common prefix.
				// - combing euqal stack suffices -- as there are a finite number of DFA's
				sammccallUnsubmitted Not Done Reply Inline Actions combing -> combining euqal -> equal sammccall: combing -> combining euqal -> equal
				sammccallUnsubmitted Not Done Reply Inline Actions the "-- as..." clause is a bit confusing and doesn't seem necessary. I think what's missing here is a high level explanation of what's going on in the parse to trigger this, and explicitly mentioning this is how we get a node with multiple parents. When alternative parses converge on the same interpretation of later tokens, their stack heads end up in the same state. These are merged, resulting in a single head with several parents. sammccall: the "-- as..." clause is a bit confusing and doesn't seem necessary. I think what's missing…
				// state the parser can be in. A set of heads can be in the same state
				// though they may have different parses, these heads can be merged,
				// resulting a single head.
				//
				// E.g. we have two active stacks:
				sammccallUnsubmitted Not Done Reply Inline Actions this shows forking but not merging and in particular it suggests that #heads == #stacks, which is not the case sammccall: this shows forking but not merging and in particular it suggests that #heads == #stacks, which…
				// 0 -> 1 -> 2
				sammccallUnsubmitted Not Done Reply Inline Actions arrows are pointing the wrong way (at least opposite to our pointers) sammccall: arrows are pointing the wrong way (at least opposite to our pointers)
				// \| ^ head1, representing a stack [2, 1, 0]
				// ` -> 3
				// ^ head2, representing a stack [3, 1, 0]
				struct Graph {
				sammccallUnsubmitted Not Done Reply Inline Actions the name "Graph" is IMO too general, this isn't a reusable graph or even dag class. I think GSS is fine, it's a distinctive name from the literature, and the expansion of the abbreviation is nice and descriptive (but too verbose to be the actual name) sammccall: the name "Graph" is IMO too general, this isn't a reusable graph or even dag class. I think…
				// Represents a node in the graph.
				sammccallUnsubmitted Not Done Reply Inline Actions this comment doesn't say anything, remove? sammccall: this comment doesn't say anything, remove?
				struct Node {
				sammccallUnsubmitted Not Done Reply Inline Actions alignas(Node) (in practice this will match ForestNode so is a no-op, but it documents the intent) sammccall: alignas(Node) (in practice this will match ForestNode so is a no-op, but it documents the…
				// The parsing state presented by the graph node.
				sammccallUnsubmitted Not Done Reply Inline Actions again, drop comment unless there's something to say sammccall: again, drop comment unless there's something to say
				LRTable::StateID State : LRTable::StateBits;
				static constexpr unsigned PredecessorBits = 3;
				// Number of the predecessors of the node.
				sammccallUnsubmitted Not Done Reply Inline Actions This is not what predecessor usually means in graph theory: u is the predecessor of v if there is some path from u->v. I think "parent" is a common and well-understood term. sammccall: This is not what predecessor usually means in graph theory: u is the predecessor of v if there…
				// u is the predecessor of v, if u -> v.
				unsigned PredecessorCount : PredecessorBits;
				// The forest node for a termina/nonterminal symbol.
				// The symbol correponds to the label of edges which leads to current node
				sammccallUnsubmitted Not Done Reply Inline Actions Also not sure what this comment says: first line just repeats the type: type is a forest node, forest nodes are always for symbols, and terminal/nonterminal are all the possibilities second line is either referring to or defining (not sure) edge labels, but edge labels aren't defined or referred to anywhere else, so why? Maybe: The parse tree for the last symbol we parsed. This symbol appears on the left of the dot in the parse state. sammccall: Also not sure what this comment says: - first line just repeats the type: type is a forest…
				// from the predecessor nodes.
				const ForestNode *Parsed = nullptr;

				llvm::ArrayRef<const Node *> predecessors() const {
				return llvm::makeArrayRef(reinterpret_cast<const Node const >(this + 1),
				PredecessorCount);
				};

				bool operator==(const Node &L) const {
				sammccallUnsubmitted Not Done Reply Inline Actions why is Parsed not part of the identity? (maybe there's a good reason, but there should be a comment) sammccall: why is Parsed not part of the identity? (maybe there's a good reason, but there should be a…
				return State == L.State && predecessors() == L.predecessors();
				}
				// A trailing array of Node*.
				};

				// Creates a new node in the graph.
				const Node *addNode(LRTable::StateID State,
				const ::clang::syntax::pseudo::ForestNode *Symbol,
				llvm::ArrayRef<const Node *> Predecessors) {
				assert(Predecessors.size() < (1 << Node::PredecessorBits) &&
				sammccallUnsubmitted Not Done Reply Inline Actions 8 parents isn't that many and it seems like this might be a dynamic property of the input code rather than a static property of the grammar. But I don't think this bitpacking is buying anything, it looks like the layout is: State : 13 PredecessorCount : 3 (padding) : 48 Parsed : 64 So I think we might as well just use a uint16 PredecessorCount and still have room left over for a uint16 refcount later. Is there anything else we might usefully store in the extra space? sammccall: 8 parents isn't that many and it seems like this might be a dynamic property of the input code…
				"Too many predecessors to fit in PredecessorBits!");
				++NodeCount;
				Node *Result = new (Arena.Allocate(
				sizeof(Node) + Predecessors.size() * sizeof(Node *), alignof(Node)))
				Node({State, static_cast<unsigned>(Predecessors.size())});
				Result->Parsed = Symbol;
				if (!Predecessors.empty())
				llvm::copy(Predecessors, reinterpret_cast<const Node **>(Result + 1));
				return Result;
				}

				size_t bytes() const { return Arena.getTotalMemory() + sizeof(*this); }
				size_t nodeCount() const { return NodeCount; }
				sammccallUnsubmitted Not Done Reply Inline Actions name consistently with forest arena sammccall: name consistently with forest arena

				private:
				llvm::BumpPtrAllocator Arena;
				unsigned NodeCount = 0;
				};

				class GLRParser {
				public:
				GLRParser(const LRTable &T, const Grammar &G, ForestArena &Arena)
				: ParsingTable(T), G(G), ParsedForest(Arena) {}

				const ForestNode *parse(const TokenStream &Code);
				sammccallUnsubmitted Not Done Reply Inline Actions as discussed offline, the signature here should involve a token range, or likely an ArrayRef<ForestNode> for the relevant terminals. Probably a FIXME to allow the start symbol to be specified. sammccall: as discussed offline, the signature here should involve a token range, or likely an…

				const Graph &getGSS() const { return GSS; }
				sammccallUnsubmitted Not Done Reply Inline Actions this function never exposes an interesting graph, as it's mostly empty before + after parse(). It's only really useful for working out how much memory is allocated. If we were to drop this, the public interface of GLRParser would be a single function (so it need not be a public class) and Graph would disappear entirely! With this in mind, it seems like we could replace this header file with something like: struct ParseStats { unsigned GSSBytes = 0; }; ForestNode glrParse(const TokenStream &, const LRTable&, const Grammar&, ForestArena&, ParseStats Stats = nullptr); It feels like hiding the GSS might be an obstacle to debugging/testing. However this really would need a story/API around how to interrupt the parser. LLVM_DEBUG seems reasonable to me for now. sammccall: this function never exposes an interesting graph, as it's mostly empty before + after parse().
				sammccallUnsubmitted Not Done Reply Inline Actions After offline discussion I think we either want: to hide GSS as mentioned above, and just test the forest output expose GSS class, and have a two versions of the parse function: one that finalizes the GSS into a result ForestNode, and a "raw" one that just returns the GSS. Then we can get at the GSS state at a point by running the raw parser on some prefix of the code. (Or even only have the raw one and make it the caller's responsibility to extract the node from the GSS, but this is probably silly) sammccall: After offline discussion I think we either want: - to hide GSS as mentioned above, and just…

				private:
				// Return a list of active stack heads.
				std::vector<const Graph::Node *> performShift(Token::Index Lookahead);
				void performReduction(const Token &Lookahead);

				void addActions(const Graph::Node *Head, const Token &Lookahead);

				const LRTable &ParsingTable;
				const Grammar &G;

				// An active stack head can have multiple avaialble actions (reduce/reduce
				// actions, reduce/shift actions)
				// Frontier is to track all avaiable actions from all active stack heads.
				struct Frontier {
				sammccallUnsubmitted Not Done Reply Inline Actions I love the name frontier! Unfortunately the word refers to the whole boundary/set, rather than individual elements of it. Maybe this? /// The list of actions we're ready to perform. struct { std::vector<Pending> Shift; std::vector<Pending> Reduce; std::vector<Pending> Accept; bool empty(); // accept is an odd-one-out here, but I think it's going away? } Frontier; Also maybe a FIXME that the frontier needs to order reductions so that local ambiguity packing is guaranteed to work as a single pass sammccall: I love the name frontier! Unfortunately the word refers to the whole boundary/set, rather than…
				// A corresponding stack head.
				const Graph::Node *Head = nullptr;
				// An action associated with the Head.
				const LRTable::Action *PerformAction = nullptr;
				sammccallUnsubmitted Not Done Reply Inline Actions just Action? sammccall: just Action?
				};
				// A list of active shift actions.
				std::vector<Frontier> ShiftList;
				// A list of active reduce actions.
				std::vector<Frontier> ReduceList;
				// A list of active accept action.
				std::vector<Frontier> AcceptLists;

				Graph GSS;
				ForestArena &ParsedForest;
				};

				} // namespace pseudo
				} // namespace syntax
				} // namespace clang

clang/lib/Tooling/Syntax/Pseudo/CMakeLists.txt

	set(LLVM_LINK_COMPONENTS Support)			set(LLVM_LINK_COMPONENTS Support)

	add_clang_library(clangToolingSyntaxPseudo			add_clang_library(clangToolingSyntaxPseudo
	DirectiveMap.cpp			DirectiveMap.cpp
	Grammar.cpp			Grammar.cpp
	GrammarBNF.cpp			GrammarBNF.cpp
	Lex.cpp			Lex.cpp
	LRGraph.cpp			LRGraph.cpp
	LRTable.cpp			LRTable.cpp
	LRTableBuild.cpp			LRTableBuild.cpp
				Forest.cpp
				GLRParser.cpp
	Token.cpp			Token.cpp

	LINK_LIBS			LINK_LIBS
	clangBasic			clangBasic
	clangLex			clangLex
	)			)

clang/lib/Tooling/Syntax/Pseudo/Forest.cpp

This file was added.

				//===--- Forest.cpp - Parse forest ------------------------------- C++--===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "clang/Tooling/Syntax/Pseudo/Forest.h"
				#include "clang/Tooling/Syntax/Pseudo/Token.h"
				#include "llvm/ADT/ArrayRef.h"
				#include "llvm/ADT/None.h"
				#include "llvm/ADT/STLExtras.h"
				#include "llvm/Support/ErrorHandling.h"
				#include "llvm/Support/FormatVariadic.h"

				namespace clang {
				namespace syntax {
				namespace pseudo {

				std::string ForestNode::Dump(const Grammar &G) const {
				switch (kind()) {
				case Ambiguous:
				return llvm::formatv("{0} := <ambiguous>", G.symbolName(symbol()));
				case Terminal:
				return llvm::formatv("{0} := tok[{1}]", G.symbolName(symbol()), startLoc());
				case Sequence:
				return G.dumpRule(rule());
				}
				llvm_unreachable("unhandle node kind!");
				sammccallUnsubmitted Not Done Reply Inline Actions unhandled sammccall: unhandled
				}

				std::string ForestNode::DumpRecursive(const Grammar &G,
				bool Abbreviated) const {
				// Count visits of nodes so we can mark those seen multiple times.
				llvm::DenseMap<const ForestNode *, unsigned> Visits;
				std::function<void(const ForestNode *)> CountVisits =
				[&](const ForestNode *P) {
				if (Visits[P]++ > 0)
				return; // Don't count children as multiply visited.
				if (P->kind() == Ambiguous)
				llvm::for_each(P->alternatives(), CountVisits);
				else if (P->kind() == Sequence)
				llvm::for_each(P->elements(), CountVisits);
				};
				CountVisits(this);

				llvm::DenseMap<const ForestNode *, unsigned> Ids;
				std::string Result;
				constexpr unsigned kEnd = std::numeric_limits<unsigned>::max();
				std::function<void(const ForestNode *, unsigned, unsigned,
				llvm::Optional<SymbolID>)>
				Dump = [&](const ForestNode *P, unsigned Level, unsigned End,
				llvm::Optional<SymbolID> ElidedParent) {
				// absl::Span<const Node* const> children;
				sammccallUnsubmitted Not Done Reply Inline Actions commented out code sammccall: commented out code
				llvm::ArrayRef<const ForestNode *> children;
				auto end_of_element = [&](unsigned child_index) {
				return child_index + 1 == children.size()
				? End
				: children[child_index + 1]->startLoc();
				};
				if (P->kind() == Ambiguous) {
				children = P->alternatives();
				} else if (P->kind() == Sequence) {
				children = P->elements();
				if (Abbreviated) {
				if (P->startLoc() == End)
				return;
				for (unsigned i = 0; i < children.size(); ++i)
				if (children[i]->startLoc() == P->startLoc() &&
				end_of_element(i) == End) {
				return Dump(
				children[i], Level, End,
				/elided_parent=/ElidedParent.getValueOr(P->symbol()));
				sammccallUnsubmitted Not Done Reply Inline Actions a few of these comments have the wrong case on the names sammccall: a few of these comments have the wrong case on the names
				}
				}
				}

				// FIXME: pretty ascii trees
				if (End == kEnd)
				Result += llvm::formatv("[{0},end) ", P->startLoc());
				else
				Result += llvm::formatv("[{0},{1}) ", P->startLoc(), End);
				Result.append(2 * Level, ' ');
				if (ElidedParent.hasValue()) {
				Result += G.symbolName(*ElidedParent);
				Result += "~";
				}
				Result.append(P->Dump(G));
				if (Visits.find(P)->getSecond() > 1 &&
				P->kind() != ForestNode::Terminal) {
				// The first time, print as #1. Later, =#1.
				auto id = Ids.try_emplace(P, Ids.size() + 1);
				Result +=
				llvm::formatv(" {0}#{1}", id.second ? "" : "=", id.first->second);
				}
				Result.push_back('\n');

				++Level;
				for (unsigned i = 0; i < children.size(); ++i)
				Dump(children[i], Level,
				P->kind() == Sequence ? end_of_element(i) : End, llvm::None);
				};
				Dump(this, 0, kEnd, llvm::None);
				return Result;
				}

				void ForestArena::init(const TokenStream &Tokens) {
				Arena.Reset(); // clean the arena.
				NodeNum = 0;
				// List of leaves is prepopulated, it's convenient and we need them anyway.
				Terminals = Arena.Allocate<ForestNode>(Tokens.tokens().size());
				size_t Index = 0;
				for (const auto &T : Tokens.tokens()) {
				new (&Terminals[Index])
				ForestNode(ForestNode::Terminal, tokenSymbol(T.Kind),
				/begin=/Index, /TerminalData/ 0);
				++Index;
				}
				NodeNum = Tokens.tokens().size();
				}

				} // namespace pseudo
				} // namespace syntax
				} // namespace clang

clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp

This file was added.

				//===--- GLRParser.cpp ------------------------------------------ C++--===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "clang/Tooling/Syntax/Pseudo/GLRParser.h"
				#include "clang/Basic/TokenKinds.h"
				#include "clang/Tooling/Syntax/Pseudo/Grammar.h"
				#include "clang/Tooling/Syntax/Pseudo/LRTable.h"
				#include "clang/Tooling/Syntax/Pseudo/Token.h"
				#include "llvm/ADT/ArrayRef.h"
				#include "llvm/ADT/STLExtras.h"
				#include "llvm/ADT/StringExtras.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Support/ErrorHandling.h"
				#include "llvm/Support/FormatVariadic.h"
				#include <memory>
				#include <tuple>

				#define DEBUG_TYPE "GLRParser.cpp"

				namespace clang {
				namespace syntax {
				namespace pseudo {

				using StateID = LRTable::StateID;

				const ForestNode *GLRParser::parse(const TokenStream &Code) {
				ParsedForest.init(Code);
				const Token *Lookahead = &Code.tokens().front();
				sammccallUnsubmitted Not Done Reply Inline Actions It actually currently looks like it would be at least as natural to have the parser operate on the sequence of terminal ForestNodes rather than on the Tokens themselves... sammccall: It actually currently looks like it would be at least as natural to have the parser operate on…
				sammccallUnsubmitted Not Done Reply Inline Actions I'd suggest calling this NextTok instead of Lookahead: in common english this is a verb rather than a noun in parsers it more often refers to the number of tokens in the tables than the tokens themselves referring to the token you're currently shifting as "ahead" is bizarre! sammccall: I'd suggest calling this NextTok instead of Lookahead: - in common english this is a verb…
				addActions(GSS.addNode(/StartState/ 0, nullptr, {}), *Lookahead);

				while (!ShiftList.empty() \|\| !ReduceList.empty()) {
				LLVM_DEBUG(llvm::dbgs() << llvm::formatv(
				"Lookahead token {0} (id: {1} text: '{2}')\n",
				G.symbolName(tokenSymbol(Lookahead->Kind)),
				tokenSymbol(Lookahead->Kind), Lookahead->text()));

				performReduction(*Lookahead);
				auto NewHeads = performShift(Code.index(*Lookahead));

				if (Lookahead->Kind != tok::eof)
				++Lookahead;
				for (const auto &AS : NewHeads)
				addActions(AS, *Lookahead);
				}

				if (!AcceptLists.empty()) {
				// FIXME: supporting multiple accepted symbols. It should be fine now, as we
				// only have one production for the start symbol `_`. This would become a
				// problem when we support parsing any code snippet rather than the
				// translation unit.
				assert(AcceptLists.size() == 1);
				LLVM_DEBUG(llvm::dbgs() << llvm::formatv("Accept: {0} accepted results:\n",
				AcceptLists.size()));
				for (const auto &A : AcceptLists)
				LLVM_DEBUG(llvm::dbgs()
				<< " - " << G.symbolName(A.Head->Parsed->symbol()) << "\n");
				return AcceptLists.front().Head->Parsed;
				}
				return nullptr;
				sammccallUnsubmitted Not Done Reply Inline Actions Returning null when the input doesn't match the grammar is not at all what we want. I know it's early, but I'm worried we're defining an API for the GLR parser that matches the textbook rather than the one we need. (Similar to concerns about the Accept action - we're going to accept any stream of tokens). I think we need to plan a little bit how error recovery is going to work here: do we plan to call this recursively inside brackets, or handle lazy-brackets in the GLR parser itself? will sequence recovery (after `,` or between declarations) happen in the GLR parser or externally? in this top loop? how will parsing continue? where would splitting heuristics happen (like interpreting `x? = y?` as an assignment expression, or `foo(; bar;` as two statements) in the worst case, when we're parsing something with no sequence recovery (say an expression) and we don't match the grammar, no splitting rule applies etc... are we going to create an opaque node? Does this function do it or the caller? I suspect it would be a better fit to: a) have an Opaque node, if not in this patch then very soon b) operate on an ArrayRef<Token> or ArrayRef<ForestNode>, and stop relying on eof to delimit (so we can parse subranges if needed) c) stop relying on Accept actions, since we'll want to look at the stack when we run out of tokens in general, and Accept just models a trivial case of that sammccall: Returning null when the input doesn't match the grammar is not at all what we want. I know it's…
				}

				std::vector<const Graph::Node *>
				sammccallUnsubmitted Not Done Reply Inline Actions if this is a hot loop we shouldn't be creating std::vectors and throwing them away sammccall: if this is a hot loop we shouldn't be creating std::vectors and throwing them away
				GLRParser::performShift(Token::Index Lookahead) {
				assert(ReduceList.empty() &&
				"Reduce actions must be performed before shift actions");
				if (ShiftList.empty())
				return {};
				LLVM_DEBUG(llvm::dbgs() << llvm::formatv(
				" Perform Shift ({0} active heads):\n", ShiftList.size()));

				const pseudo::ForestNode *Leaf = &ParsedForest.terminal(Lookahead);
				// New heads after performing all the shifts.
				std::vector<const Graph::Node *> NewHeads;

				// Merge the stack -- if multiple stack heads are going to shift a same
				sammccallUnsubmitted Not Done Reply Inline Actions we shift tokens, not states: the token conceptually moves from the input to the stack. (This seems to be other references not just me!) if multiple stack heads will reach the same state after shifting a token? sammccall: we shift tokens, not states: the token conceptually moves from the input to the stack. (This…
				// state, we perform the shift only once by combining these heads.
				//
				// E.g. we have two heads (2, 3) in the GSS, and state 4 is to be shifted from
				// state 2 and state 3:
				// 0 -> 1 -> 2
				sammccallUnsubmitted Not Done Reply Inline Actions I think the arrows directions aren't particularly clear (our pointers go the opposite way) or necessary here, and backticks and up arrows are hard to read. A couple of (widely-supported) box-drawing characters would help a lot I think: 0--1--2 └--3 0--1--2--4 └--3--┘ (I would keep using `-` and `\|` rather than making everything pretty boxes, it's readable enough and easier to edit. We may want to consider box-drawing characters for dump functions though...) sammccall: I think the arrows directions aren't particularly clear (our pointers go the opposite way) or…
				// ` -> 3
				// After the shift action, the GSS looks like below, state 4 becomes the new
				// head:
				// 0 -> 1 -> 2 -> 4
				// ` -> 3 ---^
				//
				// Shifts are partitioned by the shift state, so each partition (per loop
				// iteration) corresponds to a "perform" shift.
				sammccallUnsubmitted Not Done Reply Inline Actions I can't parse `a "perform" shift`. Batch shifts by target state so we can merge matching groups? sammccall: I can't parse `a "perform" shift`. Batch shifts by target state so we can merge matching…
				llvm::sort(ShiftList, [](const Frontier &L, const Frontier &R) {
				assert(L.PerformAction->kind() == LRTable::Action::Shift &&
				R.PerformAction->kind() == LRTable::Action::Shift);
				return std::forward_as_tuple(L.PerformAction->getShiftState(), L.Head) <
				sammccallUnsubmitted Not Done Reply Inline Actions I don't think tiebreaking by `Head` achieves anything. If we really need to ensure deterministic behavior, comparing pointers won't do that as allocation may vary. On the other hand, `stable_sort` should work: we'll tiebreak by order enqueued, so we're deterministic if our inputs are. sammccall: I don't think tiebreaking by `Head` achieves anything. If we really need to ensure…
				std::forward_as_tuple(R.PerformAction->getShiftState(), R.Head);
				});
				auto Partition = llvm::makeArrayRef(ShiftList);
				while (!Partition.empty()) {
				StateID NextState = Partition.front().PerformAction->getShiftState();
				auto Batch = Partition.take_while([&NextState](const Frontier &A) {
				return A.PerformAction->getShiftState() == NextState;
				});
				assert(!Batch.empty());
				// Predecessors of the new head in GSS.
				std::vector<const Graph::Node *> Predecessors;
				sammccallUnsubmitted Not Done Reply Inline Actions SmallVector sammccall: SmallVector
				llvm::for_each(Batch, [&Predecessors](const Frontier &F) {
				assert(llvm::find(Predecessors, F.Head) == Predecessors.end() &&
				sammccallUnsubmitted Not Done Reply Inline Actions llvm::is_contained sammccall: llvm::is_contained
				"Unexpected duplicated stack heads during shift!");
				Predecessors.push_back(F.Head);
				});
				const auto *Head = GSS.addNode(NextState, Leaf, Predecessors);
				sammccallUnsubmitted Not Done Reply Inline Actions if this turns out to be hot, you could avoid the temporary array of predecessors: return a mutable node from addNode sammccall: if this turns out to be hot, you could avoid the temporary array of predecessors: return a…
				LLVM_DEBUG(llvm::dbgs()
				<< llvm::formatv(" - state {0} -> state {1}\n",
				Partition.front().Head->State, NextState));

				NewHeads.push_back(Head);
				// Next iteration for next partition.
				Partition = Partition.drop_front(Batch.size());
				}
				ShiftList.clear();
				return NewHeads;
				}

				static std::vector<std::string>
				getStateString(llvm::ArrayRef<const Graph::Node *> A) {
				sammccallUnsubmitted Not Done Reply Inline Actions name doesn't match return type sammccall: name doesn't match return type
				std::vector<std::string> States;
				for (const auto &N : A)
				States.push_back(llvm::formatv("state {0}", N->State));
				sammccallUnsubmitted Not Done Reply Inline Actions I suspect `S{0}` will be verbose enough, it's not likely spelling out `state` will make this much less cryptic, but it may be harder to scan sammccall: I suspect `S{0}` will be verbose enough, it's not likely spelling out `state` will make this…
				return States;
				}

				// Enumerate all reduce paths on the stack by traversing from the given Head in
				// the GSS.
				static void enumerateReducePath(const Graph::Node *Head, unsigned PathLength,
				std::vector<const Graph::Node *> &PathStorage,
				std::function<void()> CB) {
				assert(PathStorage.empty() && "PathStorage must be empty!");
				std::function<void(const Graph::Node *, unsigned)> enumPath =
				sammccallUnsubmitted Not Done Reply Inline Actions We're leaning on std::function a lot for code that's supposedly in the hot path. It's probably fine, but I'd like to see this written more directly. As far as I can tell, this could easily be a class, with enumerateReducePath a (recursive) method that concretely calls into handleReducePath() or something at the bottom. Also if it's a class we can easily reset() instead of destroying it if we want to reuse its vectors across separate reductions. sammccall: We're leaning on std::function a lot for code that's supposedly in the hot path. It's probably…
				[&CB, &PathStorage, &enumPath](const Graph::Node *Current,
				unsigned Length) -> void {
				assert(Length > 0);
				PathStorage.push_back(Current);
				if (--Length == 0)
				return CB();

				for (const auto *Next : Current->predecessors())
				enumPath(Next, Length);
				PathStorage.pop_back();
				};
				enumPath(Head, PathLength);
				}

				// Perform reduction recursively until we don't have reduce actions with
				// heads.
				void GLRParser::performReduction(const Token &Lookahead) {
				sammccallUnsubmitted Not Done Reply Inline Actions (just a note, I need to dig into this part tomorrow!) sammccall: (just a note, I need to dig into this part tomorrow!)
				if (!ReduceList.empty())
				LLVM_DEBUG(llvm::dbgs() << " Performing Reduce\n");

				// Reduce can manipulate the GSS in following way:
				//
				// 1) Split --
				// 1.1 when a stack head has mutiple reduce actions, the head is
				// made to split to accommodate the various possiblities.
				// E.g.
				// 0 -> 1 (ID)
				// After performing reduce of production rules (class-name := ID,
				// enum-name := ID), the GSS now has two new heads:
				// 0 -> 2 (class-name)
				// `-> 3 (enum-name)
				//
				// 1.2 when a stack head has a reduce action with multiple reduce
				// paths, the head is to split.
				// E.g.
				// ... -> 1(...) -> 3 (INT)
				// ^
				// ... -> 2(...) ---\|
				//
				// After the reduce action (simple-type-specifier := INT), the GSS looks
				// like:
				// ... -> 1(...) -> 4 (simple-type-specifier)
				// ... -> 2(...) -> 5 (simple-type-specifier)
				//
				// 2) Merge -- if multiple heads turn out to be identical after
				// reduction (new heads have the same state, and point to the same
				// predecessors), these heads are merged and treated as a single head.
				// This is usually where ambiguity happens.
				//
				// E.g.
				// 0 -> 2 (class-name)
				// ` -> 3 (enum-name)
				// After reduction of rules (type-name := class-name \| enum-name), the GSS
				// has the following form:
				// 0 -> 4 (type-name)
				// The type-name forest node in the new head 4 is ambiguous, which has two
				// parses (type-name -> class-name -> id, type-name -> enum-name -> id).

				// Store all newly-created stack heads for tracking ambiguities.
				std::vector<const Graph::Node *> CreatedHeads;
				while (!ReduceList.empty()) {
				auto RA = std::move(ReduceList.back());
				ReduceList.pop_back();

				RuleID ReduceRuleID = RA.PerformAction->getReduceRule();
				const Rule &ReduceRule = G.lookupRule(ReduceRuleID);
				LLVM_DEBUG(llvm::dbgs() << llvm::formatv(
				" !reduce rule {0}: {1} head: {2}\n", ReduceRuleID,
				G.dumpRule(ReduceRuleID), RA.Head->State));

				std::vector<const Graph::Node *> ReducePath;
				enumerateReducePath(RA.Head, ReduceRule.Size, ReducePath, [&]() {
				LLVM_DEBUG(
				llvm::dbgs() << llvm::formatv(
				" stack path: {0}, bases: {1}\n",
				llvm::join(getStateString(ReducePath), " -> "),
				llvm::join(getStateString(ReducePath.back()->predecessors()),
				", ")));

				// A reduce is a back-and-forth operation in the stack.
				// For example, we reduce a rule "declaration := decl-specifier-seq ;" on
				// the linear stack:
				//
				// 0 -> 1(decl-specifier-seq) -> 3(;)
				// ^ Base ^ Head
				// <--- ReducePath: [3,1] ---->
				//
				// 1. back -- pop \|ReduceRuleLength\| nodes (ReducePath) in the stack;
				// 2. forth -- push a new node in the stack and mark it as a head;
				// 0 -> 4(declaration)
				// ^ Head
				//
				// It becomes tricky if a reduce path has multiple bases, we want to merge
				// them if their next state is the same. Similiar to above performShift,
				// we partition the bases by their next state, and process each partition
				// per loop iteration.
				struct BaseInfo {
				// An intermediate head after the stack has poped \|ReducePath\| nodes.
				const Graph::Node *Base = nullptr;
				// The final state after reduce.
				// It is getGoToState(Base->State, ReduceSymbol).
				StateID NextState;
				};
				std::vector<BaseInfo> Bases;
				for (const Graph::Node *Base : ReducePath.back()->predecessors())
				Bases.push_back(
				{Base, ParsingTable.getGoToState(Base->State, ReduceRule.Target)});
				llvm::sort(Bases, [](const BaseInfo &L, const BaseInfo &R) {
				return std::forward_as_tuple(L.NextState, L.Base) <
				std::forward_as_tuple(R.NextState, R.Base);
				});

				llvm::ArrayRef<BaseInfo> Partition = llvm::makeArrayRef(Bases);
				while (!Partition.empty()) {
				StateID NextState = Partition.front().NextState;
				// Predecessors of the new stack head.
				std::vector<const Graph::Node *> Predecessors;
				auto Batch = Partition.take_while([&](const BaseInfo &TB) {
				if (NextState != TB.NextState)
				return false;
				Predecessors.push_back(TB.Base);
				return true;
				});
				assert(!Batch.empty());
				Partition = Partition.drop_front(Batch.size());

				// Check ambiguities.
				auto It = llvm::find_if(CreatedHeads, [&](const Graph::Node *Head) {
				return Head->Parsed->symbol() == ReduceRule.Target &&
				Head->predecessors() == llvm::makeArrayRef(Predecessors);
				});
				if (It != CreatedHeads.end()) {
				// This should be guaranteed by checking the equalivent of
				// predecessors and reduce nonterminal symbol!
				assert(NextState == (*It)->State);
				LLVM_DEBUG(llvm::dbgs() << llvm::formatv(
				" found ambiguity, merged in state {0} (forest "
				"'{1}')\n",
				(It)->State, G.symbolName((It)->Parsed->symbol())));
				// FIXME: create ambiguous foreset node!
				continue;
				}

				// Create a corresponding sequence forest node for the reduce rule.
				std::vector<const ForestNode *> ForestChildren;
				for (const Graph::Node *PN : llvm::reverse(ReducePath))
				ForestChildren.push_back(PN->Parsed);
				const ForestNode &ForestNode = ParsedForest.createSequence(
				ReduceRule.Target, RA.PerformAction->getReduceRule(),
				ForestChildren.front()->startLoc(), ForestChildren);
				LLVM_DEBUG(llvm::dbgs() << llvm::formatv(
				" after reduce: {0} -> state {1} ({2})\n",
				llvm::join(getStateString(Predecessors), ", "),
				NextState, G.symbolName(ReduceRule.Target)));

				// Create a new stack head.
				const Graph::Node *Head =
				GSS.addNode(NextState, &ForestNode, Predecessors);
				CreatedHeads.push_back(Head);

				// Actions that are enabled by this reduce.
				addActions(Head, Lookahead);
				}
				});
				}
				}

				void GLRParser::addActions(const Graph::Node *Head, const Token &Lookahead) {
				for (const auto &Action :
				ParsingTable.getActions(Head->State, tokenSymbol(Lookahead.Kind))) {
				switch (Action.kind()) {
				case LRTable::Action::Shift:
				ShiftList.push_back({Head, &Action});
				break;
				case LRTable::Action::Reduce:
				ReduceList.push_back({Head, &Action});
				break;
				case LRTable::Action::Accept:
				AcceptLists.push_back({Head, &Action});
				break;
				default:
				llvm_unreachable("unexpected action kind!");
				}
				}
				}

				} // namespace pseudo
				} // namespace syntax
				} // namespace clang

clang/tools/clang-pseudo/ClangPseudo.cpp

//===-- ClangPseudo.cpp - Clang pseudo parser tool ------------------------===//		//===-- ClangPseudo.cpp - Clang pseudo parser tool ------------------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "clang/Basic/LangOptions.h"		#include "clang/Basic/LangOptions.h"
#include "clang/Tooling/Syntax/Pseudo/DirectiveMap.h"		#include "clang/Tooling/Syntax/Pseudo/DirectiveMap.h"
		#include "clang/Tooling/Syntax/Pseudo/GLRParser.h"
#include "clang/Tooling/Syntax/Pseudo/Grammar.h"		#include "clang/Tooling/Syntax/Pseudo/Grammar.h"
#include "clang/Tooling/Syntax/Pseudo/LRGraph.h"		#include "clang/Tooling/Syntax/Pseudo/LRGraph.h"
#include "clang/Tooling/Syntax/Pseudo/LRTable.h"		#include "clang/Tooling/Syntax/Pseudo/LRTable.h"
#include "clang/Tooling/Syntax/Pseudo/Token.h"		#include "clang/Tooling/Syntax/Pseudo/Token.h"
#include "llvm/ADT/StringExtras.h"		#include "llvm/ADT/StringExtras.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/FormatVariadic.h"		#include "llvm/Support/FormatVariadic.h"
#include "llvm/Support/MemoryBuffer.h"		#include "llvm/Support/MemoryBuffer.h"

using clang::syntax::pseudo::Grammar;		using clang::syntax::pseudo::Grammar;
using llvm::cl::desc;		using llvm::cl::desc;
using llvm::cl::init;		using llvm::cl::init;
using llvm::cl::opt;		using llvm::cl::opt;

static opt<std::string>		static opt<std::string>
Grammar("grammar", desc("Parse and check a BNF grammar file."), init(""));		Grammar("grammar", desc("Parse and check a BNF grammar file."), init(""));
static opt<bool> PrintGrammar("print-grammar", desc("Print the grammar."));		static opt<bool> PrintGrammar("print-grammar", desc("Print the grammar."));
static opt<bool> PrintGraph("print-graph",		static opt<bool> PrintGraph("print-graph",
desc("Print the LR graph for the grammar"));		desc("Print the LR graph for the grammar"));
static opt<bool> PrintTable("print-table",		static opt<bool> PrintTable("print-table",
desc("Print the LR table for the grammar"));		desc("Print the LR table for the grammar"));
		static opt<std::string> ParseFile("parse", desc("Parse a C++ source file"),
		sammccallUnsubmitted Not Done Reply Inline Actions nit: a C++ source file => a source file sammccall: nit: a C++ source file => a source file
		sammccallUnsubmitted Not Done Reply Inline Actions Why is this a new flag independent of `Source`? It also seems inconsistent with other flags, which describe what output is requested rather than what computation should be done. Maybe "-print-forest" and "-print-stats"? (which could also potentially control stats of other stages) sammccall: Why is this a new flag independent of `Source`? It also seems inconsistent with other flags…
		init(""));
static opt<std::string> Source("source", desc("Source file"));		static opt<std::string> Source("source", desc("Source file"));
static opt<bool> PrintSource("print-source", desc("Print token stream"));		static opt<bool> PrintSource("print-source", desc("Print token stream"));
static opt<bool> PrintTokens("print-tokens", desc("Print detailed token info"));		static opt<bool> PrintTokens("print-tokens", desc("Print detailed token info"));
static opt<bool>		static opt<bool>
PrintDirectiveMap("print-directive-map",		PrintDirectiveMap("print-directive-map",
desc("Print directive structure of source code"));		desc("Print directive structure of source code"));

static std::string readOrDie(llvm::StringRef Path) {		static std::string readOrDie(llvm::StringRef Path) {
Show All 24 Lines	if (Grammar.getNumOccurrences()) {
if (PrintGrammar)		if (PrintGrammar)
llvm::outs() << G->dump();		llvm::outs() << G->dump();
if (PrintGraph)		if (PrintGraph)
llvm::outs() << clang::syntax::pseudo::LRGraph::buildLR0(*G).dumpForTests(		llvm::outs() << clang::syntax::pseudo::LRGraph::buildLR0(*G).dumpForTests(
*G);		*G);
if (PrintTable)		if (PrintTable)
llvm::outs() << clang::syntax::pseudo::LRTable::buildSLR(*G).dumpForTests(		llvm::outs() << clang::syntax::pseudo::LRTable::buildSLR(*G).dumpForTests(
*G);		*G);
		if (ParseFile.getNumOccurrences()) {
		std::string Code = readOrDie(ParseFile);
		const auto &T = clang::syntax::pseudo::LRTable::buildSLR(*G);
		clang::LangOptions Opts;
		Opts.CPlusPlus = 1;

		auto RawTokens = clang::syntax::pseudo::lex(Code, Opts);
		auto Tokens = clang::syntax::pseudo::stripComments(cook(RawTokens, Opts));
		clang::syntax::pseudo::ForestArena Arena;
		clang::syntax::pseudo::GLRParser Parser(T, *G, Arena);
		const auto *Root = Parser.parse(Tokens);
		if (Root) {
		llvm::outs() << "parsed successfully!\n";
		llvm::outs() << "Forest bytes: " << Arena.bytes()
		<< " nodes: " << Arena.nodeNum() << "\n";
		llvm::outs() << "GSS bytes: " << Parser.getGSS().bytes()
		<< " nodes: " << Parser.getGSS().nodeCount() << "\n";
		// llvm::outs() << Root->DumpRecursive(*G, true);
		}
		}
return 0;		return 0;
}		}

if (Source.getNumOccurrences()) {		if (Source.getNumOccurrences()) {
std::string Text = readOrDie(Source);		std::string Text = readOrDie(Source);
clang::LangOptions LangOpts; // FIXME: use real options.		clang::LangOptions LangOpts; // FIXME: use real options.
auto Stream = clang::syntax::pseudo::lex(Text, LangOpts);		auto Stream = clang::syntax::pseudo::lex(Text, LangOpts);
auto Structure = clang::syntax::pseudo::DirectiveMap::parse(Stream);		auto Structure = clang::syntax::pseudo::DirectiveMap::parse(Stream);
Show All 11 Lines