This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
include/clang/Tooling/Syntax/Pseudo/
-
clang/
-
Tooling/
-
Syntax/
-
Pseudo/
-
Forest.h
-
GLRParser.h
1
Grammar.h
-
lib/Tooling/Syntax/Pseudo/
-
Tooling/
-
Syntax/
-
Pseudo/
31
GLRParser.cpp
-
GrammarBNF.cpp
-
tools/clang-pseudo/
-
clang-pseudo/
-
ClangPseudo.cpp

Differential D121368

[pseudo][WIP] Build Ambiguous forest node in the GLR Parser.
Needs ReviewPublic

Authored by hokein on Mar 10 2022, 3:07 AM.

Download Raw Diff

Details

Reviewers

sammccall

Summary

Forest node by design is unmutable. To create an ambiguous node, we have
to know all alternatives in advance.

In order to achieve that, we must perform all reductions in a careful
order (see code comments for details), so that we can gather completed
alternatives as a batch, and process them in a single pass.

E.g. considering the following grammar:

bnf
TU := stmt
TU := expr
stmt := expr
expr := ID
stmt := ID

// Ambiguous stmt forest node:
//     stmt (ambiguous)
//    /     \
//   /      stmt
//   |       |
//  stmt   expr
//    \     /
//       ID

The ambiguous Stmt node is built in a single section where we perform three reductions:

expr := ID
stmt := ID
stmt := expr (enabled after 1 is performed)

We expect to perform them in a batch way with the order {1}, {2, 3}.
When processing the batch {2, 3} where ambiguity happens, we build an
ambiguous node for stmt.

Based on https://reviews.llvm.org/D121150

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hokein created this revision.Mar 10 2022, 3:07 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 10 2022, 3:07 AM

hokein requested review of this revision.Mar 10 2022, 3:07 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 10 2022, 3:07 AM

Herald added a subscriber: alextsao1999. · View Herald Transcript

Harbormaster completed remote builds in B153525: Diff 414327.Mar 10 2022, 3:07 AM

The implementation (performReduction) is awkward at the moment, but it is completed, should give a global view of the algorithm.

alextsao1999 added inline comments.Mar 14 2022, 11:18 AM

clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp
318	Maybe we can make goto more clear? like `performGoto` after every GLR reduction.

hokein mentioned this in D122303: [pseudo] Sort nonterminals based on their reduction order..Mar 23 2022, 3:57 AM

A bunch of comments, but probably the most important one is that the inner loop feels uncomfortably chaotic:

there's a huge function with a lot of stuff in scope and referenced, it's unclear what the data flow is
there are very many (hopefully, unneccesarily many) concepts and data structures
it's not really clear what the motivation for the concepts are, comments mostly describe how the algorithm uses them

I have a suspicion that all the allocation and indirection limits performance too, but at the moment that's a secondary concern.

I realize the minimal complexity is probably unavoidably high here but it'd be good if we can sketch out the data structures from first principles and try to find a way to simplify them.

clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h
161	As discussed offline, the algorithm is simplest if the RuleIDs themselves reflect the topological order in which we want to reduce them. We require the RuleIDs to be grouped by the SymbolID of their LHS, but I think don't have much of a requirement on SymbolIDs themselves. (Currently they're alphabetical and we binary-search for `_` in the grammar constructor, but not much else). So I think we can just sort the symbols with this topo order when constructing the grammar. (This is fairly self-contained, probably affects dumps from tests, could be a separate patch if you like)
clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp
165	... can be reduced using multiple rules (avoid relying on subtleties of what the identify of a "reduce action" is)
174	base hasn't been defined. In general, you've only mentioned the special cases here, but not the basic case. (case 0?)
174	this is confusing because below you define "ReduceAction" to be specific to a base. I'd say "when a stack head can be reduced by a rule, but there are multiple possible bases for the reduction (due to a previous merge), ..."
203	This comment seems a bit confusing: it suggests that all reductions from a state are naturally "the same reduction" unless we're explicitly splitting them out for different paths here. However this doesn't really match the grammatical meaning of the word "reduction", it's the act of reducing some symbols using a rule. If the symbols or the rule change, it's a different reduction. Maybe "A reduction is characterized by the symbols on the stack being reduced and the rule used to transform them. There may be multiple reductions possible for the same rule, if some node near the top of the stack has multiple parents."
205	SmallVector. Size is at max RHS of rule
208	If we can use ruleid instead of grammar table, the comparator becomes stateless and this code is way easier to (re)organize :-)
211	nit: usually early-exit, e.g. `if (LBegin != RBegin) return LBegin < RBegin` as well as reducing nesting it puts the higher-priority rules first
214	This isn't sufficient given your documented definition of topological order: if A and B are incomparable (nether A:=B nor B:=A) then both A and B could have topological order 5, and we could sort ReduceActions as ABAB and fail to pack the ambiguity. We either need SymbolID as a second tiebreak or to bake the topological order into the symbolID/ruleid itself.
220	this is not equality of reduce actions, it's something else. Maybe just SameRangeAndSymbol?
227	unmutable --> immutable
227	nit: I think this comment belongs above the definition of order rather than below it.
227	We haven't really explained why order is important to finding all alternatives. The explanation here is useful, but still too far on the side of describing the code IMO. As we reduce, new reductions may become available. Initially, each stack head holds the terminal we just shifted. -- expr -- + -- IDENT We can only reduce by rules whose RHS end with this token. After reducing by expr := IDENT we have an `expr` on the stack: -- expr -- + -- expr We can now reduce by expr := expr + expr. If we reduce in arbitrary order until exhausted, we'd see all possible reductions, but "related" reductions may not be seen at the same time. Reductions that interpret the same token range as the same nonterminal should share a single Ambiguous forest node. If they reach the same parse state, they share a GSS node too. These nodes are immutable and so related reductions need to happen in a batch. So instead, we perform reductions in a careful order that ensures all related reductions are visible at once. A reduction of N tokens as symbol S can depend on: 1) reducing the last M < N tokens as T, then S := ... T 2) reducing the last N tokens as symbol T, then S := T To handle 1), shorter reductions happen before longer ones. To handle 2), we use the fact that S := T and T := S can't both be possible (even transitively) in a valid grammar. A topological order of the target symbols is used to ensure that T is reduced before S for fixed N (in our example).
248	nit: capital AddToOrdered...
248	ReduceQueue?
254	I think as previously discussed, I'd be happier if enumerateReducePath was an instance method and just directly wrote into OrderedReduceList (which can be a member), rather than passing around callbacks. This seems like a tight loop to be using std::function
258	if OrderedReduceList was a member instead of a local here, we wouldn't need to move data from ReduceList into OrderedReduceList, we could just put it there in the first place
269	the overall function is too long, this next chunk "process a batch of reductions that produce the same symbol over the same token range" seems like a reasonable place to pull out a function. We should introduce a name for this concept: maybe a reduction family?
273	I don't really like the complexity of the transient data structures we're building here: many hashtables and vectors for each reduction.
275	I'm not really sure precisely what "ambiguities", "equality" and "predecessors" mean in this comment. Can we be more specific?
280	this doesn't make sense inside the loop, the batch has the same target symbol by definition
282	seems a bit wasteful we have to materialize a temporary vector to copy from just because our first one is in the wrong order! Can we have it be in the right order instead?
285	avoid same name for type & variable
303	so IIUC given we're reducing 3 symbols, head is Z, Z.parent is Y: if Y has two parents X1, X2, then we may multiple reduce paths/ReduceActions [Z Y X1] [Z Y X2] but if Y.parent is X and X has two parents W1 W2, then we have one reduce path [Z Y X] and two BaseInfos for W1 and W2. why? why not rather say that there's just one ReduceAction concept and the Base is part of its identity, and it gets built by enumerateReducePath?
324	partition seems like a really important concept here, but isn't defined.
338	I'm not really sure what this chunk of commented-out code is trying to say
351	Again, you're talking about ambiguity, but haven't ever really defined it (apart from how we're going to represent parse ambiguity in the forest). Can we use more precise language? It's hard to understand what the purpose of this block is.
376	nit: comment just echoes the code
381	sure, but why?
386	we're iterating over BuiltForestNodes in order to... build forest nodes I guess BuiltForestNodes should rather be SequenceNodes? Why is it that we're grouping by GSS node here? My understanding was we wanted one forest node per (nonterminal, token range), but GSS nodes are finer grained than that (e.g. may be differentiated only by parse state, possibly parent list too?)

sammccall added inline comments.Mar 23 2022, 4:43 PM

clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp
285	how can we be sure we're not creating duplicate forest nodes? IIUC, it's possible to have the same forest nodes on top of several heads of the stack. Distinct GSS nodes due to different states, but same forest nodes. Then the states may have overlapping itemsets, and both allow the same reduction. Here we unconditionally produce a forest sequence node for each ReduceAction, and we will have two ReduceActions with the same forestnodes on the stack.

Some notes before our meeting.

It does appear that it's possible to generate duplicate forest nodes in this way, and AFAIK any method other that explicitly deduplicating creating using a map<(rule, rhsnodes), sequencenode> is going to have this problem.
The good news is:

the cache lifetime is local to the family (outer batch), collisions necessarily happen within a family (outer batch). (We can make it a member and clear it, to reuse storage)
I think we can form the forest nodes at the bottom of the enumerateReducePaths() call, and use them + base GSS node in place of the reduce path. (we never actually used internal gss path nodes, just the forest nodes + GSS base).
Now reductions found by enumerateReducePaths() is identified by (base, sequence node, new state) which are cheap to compare/group in various ways

I feel like this should simplify subsequent steps, but need to think more

hokein mentioned this in rGf383b88d8214: [pseudo] Sort nonterminals based on their reduction order..Mar 24 2022, 6:31 AM

hokein mentioned this in D121150: [pseudo] Implement the GLR parsing algorithm..May 2 2022, 12:35 PM

Revision Contents

Path

Size

clang/

include/

clang/

Tooling/

Syntax/

Pseudo/

Forest.h

13 lines

GLRParser.h

6 lines

Grammar.h

4 lines

lib/

Tooling/

Syntax/

Pseudo/

GLRParser.cpp

270 lines

GrammarBNF.cpp

41 lines

tools/

clang-pseudo/

ClangPseudo.cpp

2 lines

Diff 414327

clang/include/clang/Tooling/Syntax/Pseudo/Forest.h

Show First 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	return llvm::makeArrayRef(
reinterpret_cast<const ForestNode const >(this + 1), Num);		reinterpret_cast<const ForestNode const >(this + 1), Num);
}		}
static uint16_t SequenceData(RuleID rule,		static uint16_t SequenceData(RuleID rule,
llvm::ArrayRef<const ForestNode *> elements) {		llvm::ArrayRef<const ForestNode *> elements) {
assert(rule < (1 << RuleBits));		assert(rule < (1 << RuleBits));
assert(elements.size() < (1 << (16 - RuleBits)));		assert(elements.size() < (1 << (16 - RuleBits)));
return rule \| elements.size() << RuleBits;		return rule \| elements.size() << RuleBits;
}		}
		static uint16_t
		AmbiguousData(llvm::ArrayRef<const ForestNode *> alternatives) {
		return alternatives.size();
		}

Token::Index StartLoc;		Token::Index StartLoc;
Kind K : 4;		Kind K : 4;
SymbolID Symbol : SymbolBits;		SymbolID Symbol : SymbolBits;
// Sequences - child count : 4 \| RuleID : 12		// Sequences - child count : 4 \| RuleID : 12
// Ambiguous - child count : 16		// Ambiguous - child count : 16
// Terminal - unused		// Terminal - unused
uint16_t Data_;		uint16_t Data_;
// A trailing array of Node* .		// A trailing array of Node* .
Show All 9 Lines	public:
size_t bytes() const { return Arena.getBytesAllocated() + sizeof(this); }		size_t bytes() const { return Arena.getBytesAllocated() + sizeof(this); }

ForestNode &createSequence(SymbolID SID, RuleID RID, Token::Index Start,		ForestNode &createSequence(SymbolID SID, RuleID RID, Token::Index Start,
llvm::ArrayRef<const ForestNode *> Elements) {		llvm::ArrayRef<const ForestNode *> Elements) {
return create(ForestNode::Sequence, SID, Start,		return create(ForestNode::Sequence, SID, Start,
ForestNode::SequenceData(RID, Elements), Elements);		ForestNode::SequenceData(RID, Elements), Elements);
}		}

		ForestNode &createAmbiguous(SymbolID symbol,
		llvm::ArrayRef<const ForestNode *> alternatives) {
		assert(!alternatives.empty());
		return create(ForestNode::Ambiguous, symbol,
		alternatives.front()->startLoc(),
		ForestNode::AmbiguousData(alternatives), alternatives);
		}

void init(const TokenStream &Code);		void init(const TokenStream &Code);
const ForestNode &terminal(Token::Index Index) const {		const ForestNode &terminal(Token::Index Index) const {
assert(Terminals && "Terminals are not intialized!");		assert(Terminals && "Terminals are not intialized!");
return Terminals[Index];		return Terminals[Index];
}		}

private:		private:
ForestNode &create(ForestNode::Kind K, SymbolID SID, Token::Index Start,		ForestNode &create(ForestNode::Kind K, SymbolID SID, Token::Index Start,
Show All 19 Lines

clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h

Show First 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	struct Node {

bool operator==(const Node &L) const {		bool operator==(const Node &L) const {
return State == L.State && predecessors() == L.predecessors();		return State == L.State && predecessors() == L.predecessors();
}		}
// A trailing array of Node*.		// A trailing array of Node*.
};		};

// Creates a new node in the graph.		// Creates a new node in the graph.
const Node *addNode(LRTable::StateID State,		Node *addNode(LRTable::StateID State,
const ::clang::syntax::pseudo::ForestNode *Symbol,		const ::clang::syntax::pseudo::ForestNode *Symbol,
llvm::ArrayRef<const Node *> Predecessors) {		llvm::ArrayRef<const Node *> Predecessors) {
assert(Predecessors.size() < (1 << Node::PredecessorBits) &&		assert(Predecessors.size() < (1 << Node::PredecessorBits) &&
"Too many predecessors to fit in PredecessorBits!");		"Too many predecessors to fit in PredecessorBits!");
++NodeCount;		++NodeCount;
Node *Result = new (Arena.Allocate(		Node *Result = new (Arena.Allocate(
sizeof(Node) + Predecessors.size() * sizeof(Node *), alignof(Node)))		sizeof(Node) + Predecessors.size() * sizeof(Node *), alignof(Node)))
Node({State, static_cast<unsigned>(Predecessors.size())});		Node({State, static_cast<unsigned>(Predecessors.size())});
Result->Parsed = Symbol;		Result->Parsed = Symbol;
if (!Predecessors.empty())		if (!Predecessors.empty())
▲ Show 20 Lines • Show All 54 Lines • Show Last 20 Lines

clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h

	Show First 20 Lines • Show All 152 Lines • ▼ Show 20 Lines
	// Storage for the underlying data of the Grammar.			// Storage for the underlying data of the Grammar.
	// It can be constructed dynamically (from compiling BNF file) or statically			// It can be constructed dynamically (from compiling BNF file) or statically
	// (a compiled data-source).			// (a compiled data-source).
	struct GrammarTable {			struct GrammarTable {
	GrammarTable();			GrammarTable();

	struct Nonterminal {			struct Nonterminal {
	std::string Name;			std::string Name;
				// Value of topological order of the nonterminal.
				sammccallUnsubmitted Not Done Reply Inline Actions As discussed offline, the algorithm is simplest if the RuleIDs themselves reflect the topological order in which we want to reduce them. We require the RuleIDs to be grouped by the SymbolID of their LHS, but I think don't have much of a requirement on SymbolIDs themselves. (Currently they're alphabetical and we binary-search for `_` in the grammar constructor, but not much else). So I think we can just sort the symbols with this topo order when constructing the grammar. (This is fairly self-contained, probably affects dumps from tests, could be a separate patch if you like) sammccall: As discussed offline, the algorithm is simplest if the RuleIDs themselves reflect the…
				// For nonterminals A and B, if A := B (or transitively), then
				// A.TopologicalOrder > B.TopologicalOrder.
				unsigned TopologicalOrder = 0;
	// Corresponding rules that construct the non-terminal, it is a [start, end)			// Corresponding rules that construct the non-terminal, it is a [start, end)
	// index range of the Rules table.			// index range of the Rules table.
	struct {			struct {
	RuleID start;			RuleID start;
	RuleID end;			RuleID end;
	} RuleRange;			} RuleRange;
	};			};

	Show All 16 Lines

clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp

Show All 12 Lines
#include "clang/Tooling/Syntax/Pseudo/Token.h"		#include "clang/Tooling/Syntax/Pseudo/Token.h"
#include "llvm/ADT/ArrayRef.h"		#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/STLExtras.h"		#include "llvm/ADT/STLExtras.h"
#include "llvm/ADT/StringExtras.h"		#include "llvm/ADT/StringExtras.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/ErrorHandling.h"		#include "llvm/Support/ErrorHandling.h"
#include "llvm/Support/FormatVariadic.h"		#include "llvm/Support/FormatVariadic.h"
#include <memory>		#include <memory>
		#include <queue>
#include <tuple>		#include <tuple>

#define DEBUG_TYPE "GLRParser.cpp"		#define DEBUG_TYPE "GLRParser.cpp"

namespace clang {		namespace clang {
namespace syntax {		namespace syntax {
namespace pseudo {		namespace pseudo {

▲ Show 20 Lines • Show All 122 Lines • ▼ Show 20 Lines	if (RemainingLength == 0) {
for (const auto *Next : Current->predecessors())		for (const auto *Next : Current->predecessors())
enumPath(Next, RemainingLength);		enumPath(Next, RemainingLength);
}		}
PathStorage.pop_back();		PathStorage.pop_back();
};		};
enumPath(Head, PathLength);		enumPath(Head, PathLength);
}		}

// Perform reduction recursively until we don't have reduce actions with		// Perform reductions recursively until there is no available reductions.
// heads.		//
void GLRParser::performReduction(const Token &Lookahead) {		// Reductions can manipulate the GSS in following way:
if (!ReduceList.empty())
LLVM_DEBUG(llvm::dbgs() << " Performing Reduce\n");

// Reduce can manipulate the GSS in following way:
//		//
// 1) Split --		// 1) Split --
// 1.1 when a stack head has mutiple reduce actions, the head is		// 1.1 when a stack head has mutiple reduce actions, the head is
		sammccallUnsubmitted Not Done Reply Inline Actions ... can be reduced using multiple rules (avoid relying on subtleties of what the identify of a "reduce action" is) sammccall: ... can be reduced using multiple rules (avoid relying on subtleties of what the identify of a…
// made to split to accommodate the various possiblities.		// made to split to accommodate the various possiblities.
// E.g.		// E.g.
// 0 -> 1 (ID)		// 0 -> 1 (ID)
// After performing reduce of production rules (class-name := ID,		// After performing reduce of production rules (class-name := ID,
// enum-name := ID), the GSS now has two new heads:		// enum-name := ID), the GSS now has two new heads:
// 0 -> 2 (class-name)		// 0 -> 2 (class-name)
// `-> 3 (enum-name)		// `-> 3 (enum-name)
//		//
// 1.2 when a stack head has a reduce action with multiple reduce		// 1.2 when a stack head has a reduce action with multiple bases, the head
		sammccallUnsubmitted Not Done Reply Inline Actions base hasn't been defined. In general, you've only mentioned the special cases here, but not the basic case. (case 0?) sammccall: base hasn't been defined. In general, you've only mentioned the special cases here, but not the…
		sammccallUnsubmitted Not Done Reply Inline Actions this is confusing because below you define "ReduceAction" to be specific to a base. I'd say "when a stack head can be reduced by a rule, but there are multiple possible bases for the reduction (due to a previous merge), ..." sammccall: this is confusing because below you define "ReduceAction" to be specific to a base. I'd say…
// paths, the head is to split.		// will be split if each base leads to different states.
// E.g.		// E.g.
// ... -> 1(...) -> 3 (INT)		// ... -> 1(...) -> 3 (INT)
// ^		// ^
// ... -> 2(...) ---\|		// ... -> 2(...) ---\|
//		//
// After the reduce action (simple-type-specifier := INT), the GSS looks		// After the reduce action (simple-type-specifier := INT), the GSS looks
// like:		// like:
// ... -> 1(...) -> 4 (simple-type-specifier)		// ... -> 1(...) -> 4 (simple-type-specifier)
// ... -> 2(...) -> 5 (simple-type-specifier)		// ... -> 2(...) -> 5 (simple-type-specifier)
//		//
// 2) Merge -- if multiple heads turn out to be identical after		// 2) Merge -- if multiple heads turn out to be identical after
// reduction (new heads have the same state, and point to the same		// reduction (new heads have the same state, and point to the same
// predecessors), these heads are merged and treated as a single head.		// predecessors), these heads are merged and treated as a single head.
// This is usually where ambiguity happens.		// This is usually where ambiguity happens.
//		//
// E.g.		// E.g.
// 0 -> 2 (class-name)		// 0 -> 2 (class-name)
// ` -> 3 (enum-name)		// ` -> 3 (enum-name)
// After reduction of rules (type-name := class-name \| enum-name), the GSS		// After reduction of rules (type-name := class-name \| enum-name), the GSS
// has the following form:		// has the following form:
// 0 -> 4 (type-name)		// 0 -> 4 (type-name)
// The type-name forest node in the new head 4 is ambiguous, which has two		// The type-name forest node in the new head 4 is ambiguous, which has two
// parses (type-name -> class-name -> id, type-name -> enum-name -> id).		// parses (type-name -> class-name -> id, type-name -> enum-name -> id).
		void GLRParser::performReduction(const Token &Lookahead) {
		if (!ReduceList.empty())
		LLVM_DEBUG(llvm::dbgs() << " Performing Reduce\n");

// Store all newly-created stack heads for tracking ambiguities.		// Reductions per reduce path.
		sammccallUnsubmitted Not Done Reply Inline Actions This comment seems a bit confusing: it suggests that all reductions from a state are naturally "the same reduction" unless we're explicitly splitting them out for different paths here. However this doesn't really match the grammatical meaning of the word "reduction", it's the act of reducing some symbols using a rule. If the symbols or the rule change, it's a different reduction. Maybe "A reduction is characterized by the symbols on the stack being reduced and the rule used to transform them. There may be multiple reductions possible for the same rule, if some node near the top of the stack has multiple parents." sammccall: This comment seems a bit confusing: it suggests that all reductions from a state are naturally…
std::vector<const Graph::Node *> CreatedHeads;		struct ReduceAction {
while (!ReduceList.empty()) {		std::vector<const Graph::Node *> ReducePath;
		sammccallUnsubmitted Not Done Reply Inline Actions SmallVector. Size is at max RHS of rule sammccall: SmallVector. Size is at max RHS of rule
auto RA = std::move(ReduceList.back());		RuleID ReduceRuleID;
ReduceList.pop_back();		};
		auto OrderCmp = [this](const ReduceAction &L, const ReduceAction &R) {
		sammccallUnsubmitted Not Done Reply Inline Actions If we can use ruleid instead of grammar table, the comparator becomes stateless and this code is way easier to (re)organize :-) sammccall: If we can use ruleid instead of grammar table, the comparator becomes stateless and this code…
		auto LBegin = L.ReducePath.back()->Parsed->startLoc();
		auto RBegin = R.ReducePath.back()->Parsed->startLoc();
		if (LBegin == RBegin)
		sammccallUnsubmitted Not Done Reply Inline Actions nit: usually early-exit, e.g. `if (LBegin != RBegin) return LBegin < RBegin` as well as reducing nesting it puts the higher-priority rules first sammccall: nit: usually early-exit, e.g. `if (LBegin != RBegin) return LBegin < RBegin` as well as…
		return G.table()
		.Nonterminals[G.lookupRule(L.ReduceRuleID).Target]
		.TopologicalOrder >
		sammccallUnsubmitted Not Done Reply Inline Actions This isn't sufficient given your documented definition of topological order: if A and B are incomparable (nether A:=B nor B:=A) then both A and B could have topological order 5, and we could sort ReduceActions as ABAB and fail to pack the ambiguity. We either need SymbolID as a second tiebreak or to bake the topological order into the symbolID/ruleid itself. sammccall: This isn't sufficient given your documented definition of topological order: if A and B are…
		G.table()
		.Nonterminals[G.lookupRule(R.ReduceRuleID).Target]
		.TopologicalOrder;
		return LBegin < RBegin;
		};
		auto Equal = [this](const ReduceAction &L, const ReduceAction &R) {
		sammccallUnsubmitted Not Done Reply Inline Actions this is not equality of reduce actions, it's something else. Maybe just SameRangeAndSymbol? sammccall: this is not equality of reduce actions, it's something else. Maybe just SameRangeAndSymbol?
		return L.ReducePath.back()->Parsed->startLoc() ==
		R.ReducePath.back()->Parsed->startLoc() &&
		G.lookupRule(L.ReduceRuleID).Target ==
		G.lookupRule(R.ReduceRuleID).Target;
		};

		// Forest node is unmutable. To create an amgbiguous forest node, we need to
		sammccallUnsubmitted Not Done Reply Inline Actions unmutable --> immutable sammccall: unmutable --> immutable
		sammccallUnsubmitted Not Done Reply Inline Actions nit: I think this comment belongs above the definition of order rather than below it. sammccall: nit: I think this comment belongs above the definition of order rather than below it.
		sammccallUnsubmitted Not Done Reply Inline Actions We haven't really explained why order is important to finding all alternatives. The explanation here is useful, but still too far on the side of describing the code IMO. As we reduce, new reductions may become available. Initially, each stack head holds the terminal we just shifted. -- expr -- + -- IDENT We can only reduce by rules whose RHS end with this token. After reducing by expr := IDENT we have an `expr` on the stack: -- expr -- + -- expr We can now reduce by expr := expr + expr. If we reduce in arbitrary order until exhausted, we'd see all possible reductions, but "related" reductions may not be seen at the same time. Reductions that interpret the same token range as the same nonterminal should share a single Ambiguous forest node. If they reach the same parse state, they share a GSS node too. These nodes are immutable and so related reductions need to happen in a batch. So instead, we perform reductions in a careful order that ensures all related reductions are visible at once. A reduction of N tokens as symbol S can depend on: 1) reducing the last M < N tokens as T, then S := ... T 2) reducing the last N tokens as symbol T, then S := T To handle 1), shorter reductions happen before longer ones. To handle 2), we use the fact that S := T and T := S can't both be possible (even transitively) in a valid grammar. A topological order of the target symbols is used to ensure that T is reduced before S for fixed N (in our example). sammccall: We haven't really explained why order is important to finding all alternatives. The…
		// know all alternatives in advance.
		//
		// All Reductions must be performed in a careful order, so that we can gather
		// all ambiguous alternatives as a batch, and process them as a single pass.
		//
		// Reductions is stored in a priority queue with a sorted order according to:
		// Rule 1: Reductions which span fewer tokens are processed first;
		// Rule 2: If two reductions A := X and B := Y span the same tokens,
		// A := X is processed first if topological order of nonterminal
		// A is less than nonterminal B (That is to say: if there is a
		// production rule B := A in the grammar, the reduction A := X
		// should come first because it will enable a new reduction B := A);
		//
		// Each iteration, we construct a batch from the priority queue. Reductions in
		// the batch span the same tokens and reduce to the same nonterminal.
		//
		// Local ambiguity packing (if present) is guaranteed in each batch.
		std::priority_queue<ReduceAction, std::vector<ReduceAction>,
		decltype(OrderCmp)>
		OrderedReduceList(OrderCmp);
		auto addToOrderedReduceList = [&OrderedReduceList,
		sammccallUnsubmitted Not Done Reply Inline Actions nit: capital AddToOrdered... sammccall: nit: capital AddToOrdered...
		sammccallUnsubmitted Not Done Reply Inline Actions ReduceQueue? sammccall: ReduceQueue?
		this](decltype(ReduceList) &ReduceList) {
		std::vector<const Graph::Node *> ReducePath;
		for (const auto &RA : ReduceList) {
RuleID ReduceRuleID = RA.PerformAction->getReduceRule();		RuleID ReduceRuleID = RA.PerformAction->getReduceRule();
const Rule &ReduceRule = G.lookupRule(ReduceRuleID);		const Rule &ReduceRule = G.lookupRule(ReduceRuleID);
LLVM_DEBUG(llvm::dbgs() << llvm::formatv(
" !reduce rule {0}: {1} head: {2}\n", ReduceRuleID,
G.dumpRule(ReduceRuleID), RA.Head->State));

std::vector<const Graph::Node *> ReducePath;
enumerateReducePath(RA.Head, ReduceRule.Size, ReducePath, [&]() {		enumerateReducePath(RA.Head, ReduceRule.Size, ReducePath, [&]() {
		sammccallUnsubmitted Not Done Reply Inline Actions I think as previously discussed, I'd be happier if enumerateReducePath was an instance method and just directly wrote into OrderedReduceList (which can be a member), rather than passing around callbacks. This seems like a tight loop to be using std::function sammccall: I think as previously discussed, I'd be happier if enumerateReducePath was an instance method…
LLVM_DEBUG(		OrderedReduceList.push({ReducePath, ReduceRuleID});
llvm::dbgs() << llvm::formatv(		});
" stack path: {0}, bases: {1}\n",		}
llvm::join(getStateString(ReducePath), " -> "),		ReduceList.clear();
		sammccallUnsubmitted Not Done Reply Inline Actions if OrderedReduceList was a member instead of a local here, we wouldn't need to move data from ReduceList into OrderedReduceList, we could just put it there in the first place sammccall: if OrderedReduceList was a member instead of a local here, we wouldn't need to move data from…
llvm::join(getStateString(ReducePath.back()->predecessors()),		};
", ")));		addToOrderedReduceList(ReduceList);
assert(ReducePath.size() == ReduceRule.Size &&
"Reduce path's length must equal to the reduce rule size");		while (!OrderedReduceList.empty()) {
		std::vector<ReduceAction> Batch;
		do {
		Batch.push_back(OrderedReduceList.top());
		OrderedReduceList.pop();
		} while (!OrderedReduceList.empty() &&
		Equal(OrderedReduceList.top(), Batch.front()));

		sammccallUnsubmitted Not Done Reply Inline Actions the overall function is too long, this next chunk "process a batch of reductions that produce the same symbol over the same token range" seems like a reasonable place to pull out a function. We should introduce a name for this concept: maybe a reduction family? sammccall: the overall function is too long, this next chunk "process a batch of reductions that produce…
		// newly-created GSS node -> corresponding forest node.
		// If there are more thant 1 forest nodes, it means we hit ambiguities.
		// Used to assemble the ambiguous forest node at the end.
		llvm::DenseMap<Graph::Node , std::vector<const ForestNode >>
		sammccallUnsubmitted Not Done Reply Inline Actions I don't really like the complexity of the transient data structures we're building here: many hashtables and vectors for each reduction. sammccall: I don't really like the complexity of the transient data structures we're building here: many…
		BuiltForestNodes;
		// Track whether we hit ambiguities, determined by the equality of
		sammccallUnsubmitted Not Done Reply Inline Actions I'm not really sure precisely what "ambiguities", "equality" and "predecessors" mean in this comment. Can we be more specific? sammccall: I'm not really sure precisely what "ambiguities", "equality" and "predecessors" mean in this…
		// predecessors.
		std::vector<Graph::Node *> CreatedGSSNodes;

		for (const auto &RA : Batch) {
		SymbolID ReduceSymbolID = G.lookupRule(RA.ReduceRuleID).Target;
		sammccallUnsubmitted Not Done Reply Inline Actions this doesn't make sense inside the loop, the batch has the same target symbol by definition sammccall: this doesn't make sense inside the loop, the batch has the same target symbol by definition
		// Create a corresponding sequence forest node for the reduce rule.
		std::vector<const ForestNode *> ForestChildren;
		sammccallUnsubmitted Not Done Reply Inline Actions seems a bit wasteful we have to materialize a temporary vector to copy from just because our first one is in the wrong order! Can we have it be in the right order instead? sammccall: seems a bit wasteful we have to materialize a temporary vector to copy from just because our…
		for (const Graph::Node *PN : llvm::reverse(RA.ReducePath))
		ForestChildren.push_back(PN->Parsed);
		const ForestNode &ForestNode = ParsedForest.createSequence(
		sammccallUnsubmitted Not Done Reply Inline Actions avoid same name for type & variable sammccall: avoid same name for type & variable
		sammccallUnsubmitted Not Done Reply Inline Actions how can we be sure we're not creating duplicate forest nodes? IIUC, it's possible to have the same forest nodes on top of several heads of the stack. Distinct GSS nodes due to different states, but same forest nodes. Then the states may have overlapping itemsets, and both allow the same reduction. Here we unconditionally produce a forest sequence node for each ReduceAction, and we will have two ReduceActions with the same forestnodes on the stack. sammccall: how can we be sure we're not creating duplicate forest nodes? IIUC, it's possible to have the…
		ReduceSymbolID, RA.ReduceRuleID, ForestChildren.front()->startLoc(),
		ForestChildren);

// A reduce is a back-and-forth operation in the stack.		// A reduce is a back-and-forth operation in the stack.
// For example, we reduce a rule "declaration := decl-specifier-seq ;" on		// For example, we reduce a rule "declaration := decl-specifier-seq ;" on
// the linear stack:		// the linear stack:
//		//
// 0 -> 1(decl-specifier-seq) -> 3(;)		// 0 -> 1(decl-specifier-seq) -> 3(;)
// ^ Base ^ Head		// ^ Base ^ Head
// <--- ReducePath: [3,1] ---->		// <--- ReducePath: [3,1] ---->
//		//
// 1. back -- pop \|ReduceRuleLength\| nodes (ReducePath) in the stack;		// 1. back -- pop \|ReduceRuleLength\| nodes (ReducePath) in the stack;
// 2. forth -- push a new node in the stack and mark it as a head;		// 2. forth -- push a new node in the stack and mark it as a head;
		//
// 0 -> 4(declaration)		// 0 -> 4(declaration)
// ^ Head		// ^ Head
//		//
// It becomes tricky if a reduce path has multiple bases, we want to merge		// Each RA corresponds to a single reduce path, but a reduce path can have
		sammccallUnsubmitted Not Done Reply Inline Actions so IIUC given we're reducing 3 symbols, head is Z, Z.parent is Y: if Y has two parents X1, X2, then we may multiple reduce paths/ReduceActions [Z Y X1] [Z Y X2] but if Y.parent is X and X has two parents W1 W2, then we have one reduce path [Z Y X] and two BaseInfos for W1 and W2. why? why not rather say that there's just one ReduceAction concept and the Base is part of its identity, and it gets built by enumerateReducePath? sammccall: so IIUC given we're reducing 3 symbols, head is Z, Z.parent is Y: - if Y has two parents X1…
// them if their next state is the same. Similiar to above performShift,		// multiple Bases, which could split the stack (depends on whether their
// we partition the bases by their next state, and process each partition		// next state is identical).
// per loop iteration.		// Similiar to above `performShift`, we partition the Bases by their
		// next state, and process each partition.
struct BaseInfo {		struct BaseInfo {
// An intermediate head after the stack has poped \|ReducePath\| nodes.		// An intermediate head after the stack has poped \|ReducePath\| nodes.
const Graph::Node *Base = nullptr;		const Graph::Node *Base = nullptr;
// The final state after reduce.		// The final state after reduction.
// It is getGoToState(Base->State, ReduceSymbol).		// The value is getGoToState(Base->State, ReduceSymbol).
StateID NextState;		StateID NextState;
};		};
std::vector<BaseInfo> Bases;		std::vector<BaseInfo> Bases;
for (const Graph::Node *Base : ReducePath.back()->predecessors())		for (const Graph::Node *Base : RA.ReducePath.back()->predecessors())
Bases.push_back(		Bases.push_back(
{Base, ParsingTable.getGoToState(Base->State, ReduceRule.Target)});		{Base, ParsingTable.getGoToState(Base->State, ReduceSymbolID)});
		alextsao1999Unsubmitted Not Done Reply Inline Actions Maybe we can make goto more clear? like `performGoto` after every GLR reduction. alextsao1999: Maybe we can make goto more clear? like `performGoto` after every GLR reduction.
llvm::sort(Bases, [](const BaseInfo &L, const BaseInfo &R) {		llvm::sort(Bases, [](const BaseInfo &L, const BaseInfo &R) {
return std::forward_as_tuple(L.NextState, L.Base) <		return std::forward_as_tuple(L.NextState, L.Base) <
std::forward_as_tuple(R.NextState, R.Base);		std::forward_as_tuple(R.NextState, R.Base);
});		});

llvm::ArrayRef<BaseInfo> Partition = llvm::makeArrayRef(Bases);		llvm::ArrayRef<BaseInfo> Partition = llvm::makeArrayRef(Bases);
		sammccallUnsubmitted Not Done Reply Inline Actions partition seems like a really important concept here, but isn't defined. sammccall: partition seems like a really important concept here, but isn't defined.
while (!Partition.empty()) {		while (!Partition.empty()) {
StateID NextState = Partition.front().NextState;		StateID NextState = Partition.front().NextState;
// Predecessors of the new stack head.		// Predecessors of the new stack head.
std::vector<const Graph::Node *> Predecessors;		std::vector<const Graph::Node *> Predecessors;
auto Batch = Partition.take_while([&](const BaseInfo &TB) {		auto Batch = Partition.take_while([&](const BaseInfo &TB) {
if (NextState != TB.NextState)		if (NextState != TB.NextState)
return false;		return false;
Predecessors.push_back(TB.Base);		Predecessors.push_back(TB.Base);
return true;		return true;
});		});
assert(!Batch.empty());		assert(!Batch.empty());
Partition = Partition.drop_front(Batch.size());		Partition = Partition.drop_front(Batch.size());

		// Not needed, as it is created outside of the partition-loop.
		sammccallUnsubmitted Not Done Reply Inline Actions I'm not really sure what this chunk of commented-out code is trying to say sammccall: I'm not really sure what this chunk of commented-out code is trying to say
		// Create a corresponding sequence forest node for the reduce rule.
		// std::vector<const ForestNode *> ForestChildren;
		// for (const Graph::Node *PN : llvm::reverse(RA.ReducePath))
		// ForestChildren.push_back(PN->Parsed);
		// const ForestNode &ForestNode = ParsedForest.createSequence(
		// ReduceSymbolID, RA.ReduceRuleID,
		// ForestChildren.front()->startLoc(), ForestChildren);
		LLVM_DEBUG(llvm::dbgs() << llvm::formatv(
		" after reduce: {0} -> state {1} ({2})\n",
		llvm::join(getStateString(Predecessors), ", "),
		NextState, G.symbolName(ReduceSymbolID)));

// Check ambiguities.		// Check ambiguities.
		sammccallUnsubmitted Not Done Reply Inline Actions Again, you're talking about ambiguity, but haven't ever really defined it (apart from how we're going to represent parse ambiguity in the forest). Can we use more precise language? It's hard to understand what the purpose of this block is. sammccall: Again, you're talking about ambiguity, but haven't ever really defined it (apart from how we're…
auto It = llvm::find_if(CreatedHeads, [&](const Graph::Node *Head) {		// FIXME: this is a linear scan, it might be too slow.
return Head->Parsed->symbol() == ReduceRule.Target &&		auto It =
Head->predecessors() == llvm::makeArrayRef(Predecessors);		llvm::find_if(CreatedGSSNodes, [&](const Graph::Node *Created) {
		// Guaranteed by the side-effect of partition.
		assert(llvm::is_sorted(Created->predecessors()) &&
		llvm::is_sorted(llvm::makeArrayRef(Predecessors)));
		// Guaranteed by the Batch, where all reductions are reduced to
		// a same nonterminal.
		assert(Created->Parsed->symbol() == ReduceSymbolID);
		return Created->predecessors() ==
		llvm::makeArrayRef(Predecessors);
});		});
if (It != CreatedHeads.end()) {		if (It != CreatedGSSNodes.end()) {
// This should be guaranteed by checking the equalivent of		// This is guaranteed by the equality of predecessors and target
// predecessors and reduce nonterminal symbol!		// nonterminal of reduction rule!
assert(NextState == (*It)->State);		assert(NextState == (*It)->State);
LLVM_DEBUG(llvm::dbgs() << llvm::formatv(		LLVM_DEBUG(llvm::dbgs() << llvm::formatv(
" found ambiguity, merged in state {0} (forest "		" found ambiguity, merged in state {0} (forest "
"'{1}')\n",		"'{1}')\n",
(It)->State, G.symbolName((It)->Parsed->symbol())));		NextState, G.symbolName((*It)->Parsed->symbol())));
// FIXME: create ambiguous foreset node!		BuiltForestNodes[*It].push_back(&ForestNode);
continue;		continue;
}		}

// Create a corresponding sequence forest node for the reduce rule.		// Create a new GSS node.
		sammccallUnsubmitted Not Done Reply Inline Actions nit: comment just echoes the code sammccall: nit: comment just echoes the code
std::vector<const ForestNode *> ForestChildren;		Graph::Node *Head = GSS.addNode(NextState, &ForestNode, Predecessors);
for (const Graph::Node *PN : llvm::reverse(ReducePath))		CreatedGSSNodes.push_back(Head);
ForestChildren.push_back(PN->Parsed);		BuiltForestNodes[Head].push_back(&ForestNode);
const ForestNode &ForestNode = ParsedForest.createSequence(
ReduceRule.Target, RA.PerformAction->getReduceRule(),
ForestChildren.front()->startLoc(), ForestChildren);
LLVM_DEBUG(llvm::dbgs() << llvm::formatv(
" after reduce: {0} -> state {1} ({2})\n",
llvm::join(getStateString(Predecessors), ", "),
NextState, G.symbolName(ReduceRule.Target)));

// Create a new stack head.
const Graph::Node *Head =
GSS.addNode(NextState, &ForestNode, Predecessors);
CreatedHeads.push_back(Head);

// Actions that are enabled by this reduce.		// Actions that are enabled by this reduction.
		sammccallUnsubmitted Not Done Reply Inline Actions sure, but why? sammccall: sure, but why?
addActions(Head, Lookahead);		addActions(Head, Lookahead);
}		}
});		}
		// We're good to assmeble the ambiguous forest node if any.
		for (auto It : BuiltForestNodes) {
		sammccallUnsubmitted Not Done Reply Inline Actions we're iterating over BuiltForestNodes in order to... build forest nodes I guess BuiltForestNodes should rather be SequenceNodes? Why is it that we're grouping by GSS node here? My understanding was we wanted one forest node per (nonterminal, token range), but GSS nodes are finer grained than that (e.g. may be differentiated only by parse state, possibly parent list too?) sammccall: we're iterating over BuiltForestNodes in order to... build forest nodes I guess…
		if (It.second.size() > 1) {
		It.first->Parsed = &ParsedForest.createAmbiguous(
		It.second.front()->symbol(), It.getSecond());
		continue;
		}
		assert(It.first->Parsed == It.getSecond().front());
		}
		// OK, now add newly-enabled reductions to the ordered list;
		addToOrderedReduceList(ReduceList);
}		}
}		}

void GLRParser::addActions(const Graph::Node *Head, const Token &Lookahead) {		void GLRParser::addActions(const Graph::Node *Head, const Token &Lookahead) {
for (const auto &Action :		for (const auto &Action :
ParsingTable.getActions(Head->State, tokenSymbol(Lookahead.Kind))) {		ParsingTable.getActions(Head->State, tokenSymbol(Lookahead.Kind))) {
switch (Action.kind()) {		switch (Action.kind()) {
case LRTable::Action::Shift:		case LRTable::Action::Shift:
Show All 17 Lines

clang/lib/Tooling/Syntax/Pseudo/GrammarBNF.cpp

Show First 20 Lines • Show All 94 Lines • ▼ Show 20 Lines	std::unique_ptr<Grammar> build(llvm::StringRef BNF) {
});		});
RuleID RulePos = 0;		RuleID RulePos = 0;
for (SymbolID SID = 0; SID < T->Nonterminals.size(); ++SID) {		for (SymbolID SID = 0; SID < T->Nonterminals.size(); ++SID) {
RuleID Start = RulePos;		RuleID Start = RulePos;
while (RulePos < T->Rules.size() && T->Rules[RulePos].Target == SID)		while (RulePos < T->Rules.size() && T->Rules[RulePos].Target == SID)
++RulePos;		++RulePos;
T->Nonterminals[SID].RuleRange = {Start, RulePos};		T->Nonterminals[SID].RuleRange = {Start, RulePos};
}		}
		calculateDependencyOrder(T.get());
auto G = std::make_unique<Grammar>(std::move(T));		auto G = std::make_unique<Grammar>(std::move(T));
diagnoseGrammar(*G);		diagnoseGrammar(*G);
return G;		return G;
}		}

		void calculateDependencyOrder(GrammarTable *T) const {
		llvm::DenseMap<SymbolID, llvm::DenseSet<SymbolID>> DependencyGraph;
		for (const auto &Rule : T->Rules) {
		// A := B, A depends on B.
		if (Rule.Size == 1 && pseudo::isNonterminal(Rule.Sequence[0]))
		DependencyGraph[Rule.Target].insert(Rule.Sequence[0]);
		}
		std::vector<SymbolID> Order;
		// Each nonterminal state flows: NotVisited -> Visiting -> Visited.
		enum State {
		NotVisited,
		Visiting,
		Visited,
		};
		std::vector<SymbolID> VisitStates(T->Nonterminals.size(), NotVisited);
		std::function<void(SymbolID)> DFS = [&](SymbolID SID) -> void {
		if (VisitStates[SID] == Visited)
		return;
		if (VisitStates[SID] == Visiting) {
		Diagnostics.push_back(
		llvm::formatv("The grammar is cyclic, see symbol {0}\n",
		T->Nonterminals[SID].Name));
		return;
		}
		VisitStates[SID] = Visiting;
		auto It = DependencyGraph.find(SID);
		if (It != DependencyGraph.end()) {
		for (SymbolID Dep : (It->getSecond()))
		DFS(Dep);
		}
		VisitStates[SID] = Visited;
		Order.push_back(SID);
		};
		for (SymbolID ID = 0; ID != T->Nonterminals.size(); ++ID)
		DFS(ID);
		for (size_t I = 0; I < Order.size(); ++I) {
		T->Nonterminals[Order[I]].TopologicalOrder = I;
		}
		}

private:		private:
// Text representation of a BNF grammar rule.		// Text representation of a BNF grammar rule.
struct RuleSpec {		struct RuleSpec {
llvm::StringRef Target;		llvm::StringRef Target;
struct Element {		struct Element {
llvm::StringRef Symbol; // Name of the symbol		llvm::StringRef Symbol; // Name of the symbol
};		};
std::vector<Element> Sequence;		std::vector<Element> Sequence;
▲ Show 20 Lines • Show All 133 Lines • Show Last 20 Lines

clang/tools/clang-pseudo/ClangPseudo.cpp

Show First 20 Lines • Show All 83 Lines • ▼ Show 20 Lines	if (ParseFile.getNumOccurrences()) {
clang::syntax::pseudo::GLRParser Parser(T, *G, Arena);		clang::syntax::pseudo::GLRParser Parser(T, *G, Arena);
const auto *Root = Parser.parse(Tokens);		const auto *Root = Parser.parse(Tokens);
if (Root) {		if (Root) {
llvm::outs() << "parsed successfully!\n";		llvm::outs() << "parsed successfully!\n";
llvm::outs() << "Forest bytes: " << Arena.bytes()		llvm::outs() << "Forest bytes: " << Arena.bytes()
<< " nodes: " << Arena.nodeNum() << "\n";		<< " nodes: " << Arena.nodeNum() << "\n";
llvm::outs() << "GSS bytes: " << Parser.getGSS().bytes()		llvm::outs() << "GSS bytes: " << Parser.getGSS().bytes()
<< " nodes: " << Parser.getGSS().nodeCount() << "\n";		<< " nodes: " << Parser.getGSS().nodeCount() << "\n";
// llvm::outs() << Root->DumpRecursive(*G, true);		llvm::outs() << Root->DumpRecursive(*G, false);
}		}
}		}
return 0;		return 0;
}		}

if (Source.getNumOccurrences()) {		if (Source.getNumOccurrences()) {
std::string Text = readOrDie(Source);		std::string Text = readOrDie(Source);
clang::LangOptions LangOpts; // FIXME: use real options.		clang::LangOptions LangOpts; // FIXME: use real options.
Show All 13 Lines