This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
include/clang/Tooling/Syntax/Pseudo/
-
clang/
-
Tooling/
-
Syntax/
-
Pseudo/
9/14
LRGraph.h
-
lib/Tooling/Syntax/Pseudo/
-
Tooling/
-
Syntax/
-
Pseudo/
-
CMakeLists.txt
10/10
LRGraph.cpp
-
unittests/Tooling/Syntax/Pseudo/
-
Tooling/
-
Syntax/
-
Pseudo/
-
CMakeLists.txt
-
LRGraphTest.cpp

Differential D119172

[pseudo] Implement LRGraph
ClosedPublic

Authored by hokein on Feb 7 2022, 11:36 AM.

Download Raw Diff

Details

Reviewers

sammccall

Commits

rGf1984b143367: [pseudo] Implement LRGraph

Summary

LRGraph is the key component of the clang pseudo parser, it is a
deterministic handle-finding finite-state machine, which is used to
generated the LR parsing table.

Separate from https://reviews.llvm.org/D118196.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hokein created this revision.Feb 7 2022, 11:36 AM

Herald added subscribers: mgrang, mgorny. · View Herald TranscriptFeb 7 2022, 11:36 AM

hokein requested review of this revision.Feb 7 2022, 11:36 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 7 2022, 11:36 AM

hokein mentioned this in D118196: [syntax][pseudo] Implement LR parsing table..Feb 7 2022, 11:39 AM

Harbormaster completed remote builds in B148056: Diff 406549.Feb 7 2022, 1:41 PM

LG, thank you!
Bunch of nits, up to you what you want to do with them.

clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h
11	I think it'd be useful to be a little more concrete here than "find"... and collect the right-hand side of a production rule (called handle) on top of the stack, then replace (reduce) the handle with the nonterminal defined by the production rule.
43	The lack of spaces on the RHS is inconsistent with the debug output, and any examples with real grammars (since we need spaces to split up words). Can we use `A := x y` instead? Or possibly `A := X Y` which makes the boundaries more obvious when it appears in-line with text. (I don't have an opinion about `A := X . Y` vs `A := X.Y` (dot replaces space), but again we should pick one consistent with the debug output.
53	I think the calls to these constructors could be a little clearer as factories that describe their purpose: e.g. static Item start(RuleID ID, const Grammar &G); static Item sentinel(uint8_t ID); Item Item::advance() const; I think that would cover everything?
86	this alias is used only once in the header, for Items - it doesn't really introduce a new public concept that's distinct from that field. I think it would be clearer to write it as std::vector<Item> there, and move the alias to the implementation file.
102	call this "SortByNextSymbol"? or something? As it is it's hard at a glance to understand why we have operator< and LessItem
102	This comparator is pure implementation detail, and can go in the cpp file. (Still fine to reference in the comment, but I think "in a canonical order" is enough)
111	this seems to compare items equal if both are dot-at-the-end. In general the cascade/priority/completeness of comparisons isn't really easy to spot. Consider structuring like: if (L.hasNext() && R.hasNext() && R.Lnext(G) != R.next(G)) return L.next(G) < R.next(G); if (L.hasNext() != R.hasNext()) return L.hasNext() < R.hasNext(); // trailing dot is minimum return L < R;
132	IIUC the data structure used here is inherently LR0, rather than the choice of how we build it. (Vs in the LRTable where we could come up with data representing a full LR(1) parser without changing the structure). So I think this function doesn't need LR0 in its name (but the class could be LR0Graph if you like)?
clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp
37	This took me a while to understand: you're reusing the ItemSet (passed by value) as the work queue (well, stack) and it's already initialized to the right elements. This is clever, but please add a comment. Maybe even rename IS->Queue
38	nit: auto -> Item? (Generally if the types aren't hard to write or read, it seems nice to include them)
45	nit: I find the spacing (blank lines) here a bit confusing: this seems part of the previous "paragraph" rather than the following one. Could use a comment like "Is there a nonterminal next() to expand?"
54	You still have the `IS` array sitting there, you could reuse that again and avoid the allocation if you didn't expand the set too much :-)
85	Maybe call this Item.advance(), to avoid exposing the constructor
103	Since this is a single-line output, it seems more flexible to put the caller in charge of adding a newline if they want one? (This would be inconsistent with State::dump of course, but I think that's OK. Maybe we should work out a naming convention at some point)
113	Hardcoding the indentation seems a bit special-purpose. Make the indent level an optional parameter to State::dump() and use OS.indent()?
127	nit: more spaces would make this more readable `{0} ->[{1}] {2}` or even `{0} --[{1}]-> {2}`?
169	This really feels like a dubious optimization to me as it's incorrect if we ever get a hash collision, and the kernel ItemSet is much smaller than `States` which we're already storing. This seems like it must only save 20% or something of overall size? In the other review thread you compared the size of DenseMap<hash_code, size_t> vs DenseMap<ItemSet, size_t>, but really the total size including `States` seems like a more useful comparison.

This revision is now accepted and ready to land.Feb 8 2022, 2:01 AM

address review comments.

clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h
132	At the moment we build an LR0, but the LRGraph itself is a generic data-structure, which could be used to model the LR(1) automaton (it is a matter how we define the states/nodes in the graph). Let's say in the future, we want to build a full LR(1), then we could easily add another method LRGraph::buildLR(1), so the current names seem more reasonable to me.
clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp
169	ok, changed to `DenseMap<ItemSet, size_t>` instead. The size of `States` is ~215KB, so 15% `DenseMap<hash_code>` vs 28% for `DenseMap<ItemSet>`, not huge.

This revision was landed with ongoing or failed builds.Feb 9 2022, 2:20 AM

Closed by commit rGf1984b143367: [pseudo] Implement LRGraph (authored by hokein). · Explain Why

This revision was automatically updated to reflect the committed changes.

hokein added a commit: rGf1984b143367: [pseudo] Implement LRGraph.

Harbormaster completed remote builds in B148438: Diff 407092.Feb 9 2022, 2:39 AM

alextsao1999 added a subscriber: alextsao1999.Mar 5 2022, 2:21 AM

alextsao1999 added inline comments.

clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h
96	Can we add LookaheadSymbol here to implement LR(1)?

Herald added a project: Restricted Project. · View Herald TranscriptMar 5 2022, 2:21 AM

hokein added inline comments.Mar 7 2022, 1:10 AM

clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h
96	we could do that. However, we don't have a plan to implement an `LR(1)` yet, we use `SLR(1)`. (though LR(1) is more powerful than SLR(1), the typical deterministic LR(1) parser cannot handle the C++ grammar, we need a "general" parser GLR which can be able to handle arbitrary context-free grammars).

alextsao1999 added inline comments.Mar 7 2022, 5:48 AM

clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h
96	Thanks for your answering! Oh, I know some `GLR` parsers are based on `LR(1)` or `LALR`, so I think our `GLR` parser is based on `LR(1)` as well. I'm trying to keep up with your train of thought :)

sammccall added inline comments.Mar 7 2022, 6:49 AM

clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h
96	Yeah, GLR changes the tradeoff between more sophisticated and simpler parsers (LR(1) > LALR > SLR(1) > LR(0)). Normally the sophisticated parsers are able to handle grammars/languages that the simple ones can't, by avoiding action conflicts. So the value is very high. However with GLR we can handle action conflicts by branching, so the value is "only" avoiding the performance hit of chasing branches that don't go anywhere. So it didn't really seem worth the extra implementation complexity (or extra size of the in-memory grammar tables!) to use a more powerful parser than SLR(1). Maybe we should even have given LR(0) more thought :-)

alextsao1999 added inline comments.Mar 7 2022, 9:37 AM

clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h
96	Yes, agree. `SLR` can reduce memory usage, but it can't handle operator precedence. With the help of GLR, we can resolve the problem of operator precedence by chosing one branch. Thanks, I got it!

Revision Contents

Path

Size

clang/

include/

clang/

Tooling/

Syntax/

Pseudo/

LRGraph.h

177 lines

lib/

Tooling/

Syntax/

Pseudo/

CMakeLists.txt

1 line

LRGraph.cpp

231 lines

unittests/

Tooling/

Syntax/

Pseudo/

CMakeLists.txt

1 line

LRGraphTest.cpp

84 lines

Diff 407100

clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h

This file was added.

				//===--- LRGraph.h - Build an LR automaton ------------------- C++--===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// LR parsers are bottom-up parsers -- they scan the input from left to right,
				// and collect the right-hand side of a production rule (called handle) on top
				// of the stack, then replace (reduce) the handle with the nonterminal defined
				sammccallUnsubmitted Done Reply Inline Actions I think it'd be useful to be a little more concrete here than "find"... and collect the right-hand side of a production rule (called handle) on top of the stack, then replace (reduce) the handle with the nonterminal defined by the production rule. sammccall: I think it'd be useful to be a little more concrete here than "find"... ``` and collect the…
				// by the production rule.
				//
				// This file defines LRGraph, a deterministic handle-finding finite-state
				// automaton, which is a key component in LR parsers to recognize any of
				// handles in the grammar efficiently. We build the LR table (ACTION and GOTO
				// Table) based on the LRGraph.
				//
				// LRGraph can be constructed for any context-free grammars.
				// Even for a LR-ambiguous grammar, we can construct a deterministic FSA, but
				// interpretation of the FSA is nondeterminsitic -- we might in a state where
				// we can continue searching an handle and identify a handle (called
				// shift/reduce conflicts), or identify more than one handle (callled
				// reduce/reduce conflicts).
				//
				// LRGraph is a common model for all variants of LR automatons, from the most
				// basic one LR(0), the powerful SLR(1), LR(1) which uses a one-token lookahead
				// in making decisions.
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_CLANG_TOOLING_SYNTAX_PSEUDO_LRGRAPH_H
				#define LLVM_CLANG_TOOLING_SYNTAX_PSEUDO_LRGRAPH_H

				#include "clang/Tooling/Syntax/Pseudo/Grammar.h"
				#include "llvm/ADT/Hashing.h"
				#include <vector>

				namespace clang {
				namespace syntax {
				namespace pseudo {

				// An LR item -- a grammar rule with a dot at some position of the body.
				// e.g. a production rule A := X Y yields 3 items:
				sammccallUnsubmitted Done Reply Inline Actions The lack of spaces on the RHS is inconsistent with the debug output, and any examples with real grammars (since we need spaces to split up words). Can we use `A := x y` instead? Or possibly `A := X Y` which makes the boundaries more obvious when it appears in-line with text. (I don't have an opinion about `A := X . Y` vs `A := X.Y` (dot replaces space), but again we should pick one consistent with the debug output. sammccall: The lack of spaces on the RHS is inconsistent with the debug output, and any examples with real…
				// A := . X Y
				// A := X . Y
				// A := X Y .
				// An item indicates how much of a production rule has been recognized at a
				// position (described by dot), for example, A := X . Y indicates that we have
				// recognized the X part from the input, and we hope next to see the input
				// derivable from Y.
				class Item {
				public:
				static Item start(RuleID ID, const Grammar &G) {
				sammccallUnsubmitted Done Reply Inline Actions I think the calls to these constructors could be a little clearer as factories that describe their purpose: e.g. static Item start(RuleID ID, const Grammar &G); static Item sentinel(uint8_t ID); Item Item::advance() const; I think that would cover everything? sammccall: I think the calls to these constructors could be a little clearer as factories that describe…
				Item I;
				I.RID = ID;
				I.RuleLength = G.lookupRule(ID).Size;
				return I;
				}
				static Item sentinel(RuleID ID) {
				Item I;
				I.RID = ID;
				return I;
				}

				RuleID rule() const { return RID; }
				uint8_t dot() const { return DotPos; }

				bool hasNext() const { return DotPos < RuleLength; }
				SymbolID next(const Grammar &G) const {
				assert(hasNext());
				return G.lookupRule(RID).Sequence[DotPos];
				}

				Item advance() const {
				assert(hasNext());
				Item I = *this;
				++I.DotPos;
				return I;
				}

				std::string dump(const Grammar &G) const;

				bool operator==(const Item &I) const {
				return DotPos == I.DotPos && RID == I.RID;
				}
				bool operator<(const Item &I) const {
				sammccallUnsubmitted Done Reply Inline Actions this alias is used only once in the header, for Items - it doesn't really introduce a new public concept that's distinct from that field. I think it would be clearer to write it as std::vector<Item> there, and move the alias to the implementation file. sammccall: this alias is used only once in the header, for Items - it doesn't really introduce a new…
				return std::tie(RID, DotPos) < std::tie(I.RID, I.DotPos);
				}
				friend llvm::hash_code hash_value(const Item &I) {
				return llvm::hash_combine(I.RID, I.DotPos);
				}

				private:
				RuleID RID = 0;
				uint8_t DotPos = 0;
				uint8_t RuleLength = 0; // the length of rule body.
				alextsao1999Unsubmitted Not Done Reply Inline Actions Can we add LookaheadSymbol here to implement LR(1)? alextsao1999: Can we add LookaheadSymbol here to implement LR(1)?
				hokeinAuthorUnsubmitted Done Reply Inline Actions we could do that. However, we don't have a plan to implement an `LR(1)` yet, we use `SLR(1)`. (though LR(1) is more powerful than SLR(1), the typical deterministic LR(1) parser cannot handle the C++ grammar, we need a "general" parser GLR which can be able to handle arbitrary context-free grammars). hokein: we could do that. However, we don't have a plan to implement an `LR(1)` yet, we use `SLR(1)`.
				alextsao1999Unsubmitted Not Done Reply Inline Actions Thanks for your answering! Oh, I know some `GLR` parsers are based on `LR(1)` or `LALR`, so I think our `GLR` parser is based on `LR(1)` as well. I'm trying to keep up with your train of thought :) alextsao1999: Thanks for your answering! Oh, I know some `GLR` parsers are based on `LR(1)` or `LALR`, so I…
				sammccallUnsubmitted Not Done Reply Inline Actions Yeah, GLR changes the tradeoff between more sophisticated and simpler parsers (LR(1) > LALR > SLR(1) > LR(0)). Normally the sophisticated parsers are able to handle grammars/languages that the simple ones can't, by avoiding action conflicts. So the value is very high. However with GLR we can handle action conflicts by branching, so the value is "only" avoiding the performance hit of chasing branches that don't go anywhere. So it didn't really seem worth the extra implementation complexity (or extra size of the in-memory grammar tables!) to use a more powerful parser than SLR(1). Maybe we should even have given LR(0) more thought :-) sammccall: Yeah, GLR changes the tradeoff between more sophisticated and simpler parsers (LR(1) > LALR >…
				alextsao1999Unsubmitted Not Done Reply Inline Actions Yes, agree. `SLR` can reduce memory usage, but it can't handle operator precedence. With the help of GLR, we can resolve the problem of operator precedence by chosing one branch. Thanks, I got it! alextsao1999: Yes, agree. `SLR` can reduce memory usage, but it can't handle operator precedence. With the…
				};

				// A state represents a node in the LR automaton graph. It is an item set, which
				// contains all possible rules that the LR parser may be parsing in that state.
				//
				// Conceptually, If we knew in advance what we're parsing, at any point we're
				sammccallUnsubmitted Done Reply Inline Actions call this "SortByNextSymbol"? or something? As it is it's hard at a glance to understand why we have operator< and LessItem sammccall: call this "SortByNextSymbol"? or something? As it is it's hard at a glance to understand why…
				sammccallUnsubmitted Done Reply Inline Actions This comparator is pure implementation detail, and can go in the cpp file. (Still fine to reference in the comment, but I think "in a canonical order" is enough) sammccall: This comparator is pure implementation detail, and can go in the cpp file. (Still fine to…
				// partway through parsing a production, sitting on a stack of partially parsed
				// productions. But because we don't know, there could be several productions
				// we're partway through. The set of possibilities is the parser state, and we
				// precompute all the transitions between these states.
				struct State {
				// A full set of items (including non-kernel items) representing the state,
				// in a canonical order (see SortByNextSymbol in the cpp file).
				std::vector<Item> Items;

				sammccallUnsubmitted Done Reply Inline Actions this seems to compare items equal if both are dot-at-the-end. In general the cascade/priority/completeness of comparisons isn't really easy to spot. Consider structuring like: if (L.hasNext() && R.hasNext() && R.Lnext(G) != R.next(G)) return L.next(G) < R.next(G); if (L.hasNext() != R.hasNext()) return L.hasNext() < R.hasNext(); // trailing dot is minimum return L < R; sammccall: this seems to compare items equal if both are dot-at-the-end. In general the…
				std::string dump(const Grammar &G, unsigned Indent = 0) const;
				};

				// LRGraph is a deterministic finite state automaton for LR parsing.
				//
				// Intuitively, an LR automaton is a transition graph. The graph has a
				// collection of nodes, called States. Each state corresponds to a particular
				// item set, which represents a condition that could occur duing the process of
				// parsing a production. Edges are directed from one state to another. Each edge
				// is labeled by a grammar symbol (terminal or nonterminal).
				//
				// LRGraph is used to construct the LR parsing table which is a core
				// data-structure driving the LR parser.
				class LRGraph {
				public:
				// StateID is the index in States table.
				using StateID = uint16_t;

				// Constructs an LR(0) automaton.
				static LRGraph buildLR0(const Grammar &);

				sammccallUnsubmitted Not Done Reply Inline Actions IIUC the data structure used here is inherently LR0, rather than the choice of how we build it. (Vs in the LRTable where we could come up with data representing a full LR(1) parser without changing the structure). So I think this function doesn't need LR0 in its name (but the class could be LR0Graph if you like)? sammccall: IIUC the data structure used here is inherently LR0, rather than the choice of how we build it.
				hokeinAuthorUnsubmitted Done Reply Inline Actions At the moment we build an LR0, but the LRGraph itself is a generic data-structure, which could be used to model the LR(1) automaton (it is a matter how we define the states/nodes in the graph). Let's say in the future, we want to build a full LR(1), then we could easily add another method LRGraph::buildLR(1), so the current names seem more reasonable to me. hokein: At the moment we build an LR0, but the LRGraph itself is a generic data-structure, which could…
				// An edge in the LR graph, it represents a transition in the LR automaton.
				// If the parser is at state Src, with a lookahead Label, then it
				// transits to state Dst.
				struct Edge {
				StateID Src, Dst;
				SymbolID Label;
				};

				llvm::ArrayRef<State> states() const { return States; }
				llvm::ArrayRef<Edge> edges() const { return Edges; }

				std::string dumpForTests(const Grammar &) const;

				private:
				LRGraph(std::vector<State> States, std::vector<Edge> Edges)
				: States(std::move(States)), Edges(std::move(Edges)) {}

				std::vector<State> States;
				std::vector<Edge> Edges;
				};

				} // namespace pseudo
				} // namespace syntax
				} // namespace clang

				namespace llvm {
				// Support clang::syntax::pseudo::Item as DenseMap keys.
				template <> struct DenseMapInfo<clang::syntax::pseudo::Item> {
				static inline clang::syntax::pseudo::Item getEmptyKey() {
				return clang::syntax::pseudo::Item::sentinel(-1);
				}
				static inline clang::syntax::pseudo::Item getTombstoneKey() {
				return clang::syntax::pseudo::Item::sentinel(-2);
				}
				static unsigned getHashValue(const clang::syntax::pseudo::Item &I) {
				return hash_value(I);
				}
				static bool isEqual(const clang::syntax::pseudo::Item &LHS,
				const clang::syntax::pseudo::Item &RHS) {
				return LHS == RHS;
				}
				};
				} // namespace llvm

				#endif // LLVM_CLANG_TOOLING_SYNTAX_PSEUDO_LRGRAPH_H

clang/lib/Tooling/Syntax/Pseudo/CMakeLists.txt

	set(LLVM_LINK_COMPONENTS Support)			set(LLVM_LINK_COMPONENTS Support)

	add_clang_library(clangToolingSyntaxPseudo			add_clang_library(clangToolingSyntaxPseudo
	Grammar.cpp			Grammar.cpp
	GrammarBNF.cpp			GrammarBNF.cpp
				LRGraph.cpp

	LINK_LIBS			LINK_LIBS
	clangBasic			clangBasic
	clangLex			clangLex
	)			)

clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp

This file was added.

				//===--- LRGraph.cpp - -------------------------------------------- C++--===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "clang/Tooling/Syntax/Pseudo/LRGraph.h"
				#include "clang/Tooling/Syntax/Pseudo/Grammar.h"
				#include "llvm/ADT/DenseSet.h"
				#include "llvm/ADT/Hashing.h"
				#include "llvm/ADT/STLExtras.h"
				#include "llvm/ADT/StringExtras.h"
				#include "llvm/Support/FormatVariadic.h"
				#include "llvm/Support/raw_ostream.h"

				using ItemSet = std::vector<clang::syntax::pseudo::Item>;

				namespace llvm {
				// Support clang::syntax::pseudo::Item as DenseMap keys.
				template <> struct DenseMapInfo<ItemSet> {
				static inline ItemSet getEmptyKey() {
				return {DenseMapInfo<clang::syntax::pseudo::Item>::getEmptyKey()};
				}
				static inline ItemSet getTombstoneKey() {
				return {DenseMapInfo<clang::syntax::pseudo::Item>::getTombstoneKey()};
				}
				static unsigned getHashValue(const ItemSet &I) {
				return llvm::hash_combine_range(I.begin(), I.end());
				}
				static bool isEqual(const ItemSet &LHS, const ItemSet &RHS) {
				return LHS == RHS;
				}
				};
				} // namespace llvm

				sammccallUnsubmitted Done Reply Inline Actions This took me a while to understand: you're reusing the ItemSet (passed by value) as the work queue (well, stack) and it's already initialized to the right elements. This is clever, but please add a comment. Maybe even rename IS->Queue sammccall: This took me a while to understand: you're reusing the ItemSet (passed by value) as the work…
				namespace clang {
				sammccallUnsubmitted Done Reply Inline Actions nit: auto -> Item? (Generally if the types aren't hard to write or read, it seems nice to include them) sammccall: nit: auto -> Item? (Generally if the types aren't hard to write or read, it seems nice to…
				namespace syntax {
				namespace pseudo {
				namespace {

				struct SortByNextSymbol {
				SortByNextSymbol(const Grammar &G) : G(G) {}
				bool operator()(const Item &L, const Item &R) {
				sammccallUnsubmitted Done Reply Inline Actions nit: I find the spacing (blank lines) here a bit confusing: this seems part of the previous "paragraph" rather than the following one. Could use a comment like "Is there a nonterminal next() to expand?" sammccall: nit: I find the spacing (blank lines) here a bit confusing: this seems part of the previous…
				if (L.hasNext() && R.hasNext() && L.next(G) != R.next(G))
				return L.next(G) < R.next(G);
				if (L.hasNext() != R.hasNext())
				return L.hasNext() < R.hasNext(); // a trailing dot is minimal.
				return L < R;
				}
				const Grammar &G;
				};

				sammccallUnsubmitted Done Reply Inline Actions You still have the `IS` array sitting there, you could reuse that again and avoid the allocation if you didn't expand the set too much :-) sammccall: You still have the `IS` array sitting there, you could reuse that again and avoid the…
				// Computes a closure of the given item set S:
				// - extends the given S to contain all options for parsing next token;
				// - nonterminals after a dot are recursively expanded into the begin-state
				// of all production rules that produce that nonterminal;
				//
				// Given
				// Grammar rules = [ _ := E, E := E - T, E := T, T := n, T := ( E ) ]
				// Input = [ E := . T ]
				// returns [ E := . T, T := . n, T := . ( E ) ]
				State closure(ItemSet Queue, const Grammar &G) {
				llvm::DenseSet<Item> InQueue = {Queue.begin(), Queue.end()};
				// We reuse the passed-by-value Queue as the final result, as it's already
				// initialized to the right elements.
				size_t ItIndex = 0;
				while (ItIndex < Queue.size()) {
				const Item &ExpandingItem = Queue[ItIndex];
				++ItIndex;
				if (!ExpandingItem.hasNext())
				continue;

				SymbolID NextSym = ExpandingItem.next(G);
				if (pseudo::isToken(NextSym))
				continue;
				auto RRange = G.table().Nonterminals[NextSym].RuleRange;
				for (RuleID RID = RRange.start; RID < RRange.end; ++RID) {
				Item NewItem = Item::start(RID, G);
				if (InQueue.insert(NewItem).second) // new
				Queue.push_back(std::move(NewItem));
				}
				}
				Queue.shrink_to_fit();
				sammccallUnsubmitted Done Reply Inline Actions Maybe call this Item.advance(), to avoid exposing the constructor sammccall: Maybe call this Item.advance(), to avoid exposing the constructor
				llvm::sort(Queue, SortByNextSymbol(G));
				return {std::move(Queue)};
				}

				// Returns all next (with a dot advanced) kernel item sets, partitioned by the
				// advanced symbol.
				//
				// Given
				// S = [ E := . a b, E := E . - T ]
				// returns [
				// {id(a), [ E := a . b ]},
				// {id(-), [ E := E - . T ]}
				// ]
				std::vector<std::pair<SymbolID, ItemSet>>
				nextAvailableKernelItems(const State &S, const Grammar &G) {
				std::vector<std::pair<SymbolID, ItemSet>> Results;
				llvm::ArrayRef<Item> AllItems = S.Items;
				AllItems = AllItems.drop_while([](const Item &I) { return !I.hasNext(); });
				sammccallUnsubmitted Done Reply Inline Actions Since this is a single-line output, it seems more flexible to put the caller in charge of adding a newline if they want one? (This would be inconsistent with State::dump of course, but I think that's OK. Maybe we should work out a naming convention at some point) sammccall: Since this is a single-line output, it seems more flexible to put the caller in charge of…
				while (!AllItems.empty()) {
				SymbolID AdvancedSymbol = AllItems.front().next(G);
				auto Batch = AllItems.take_while([AdvancedSymbol, &G](const Item &I) {
				assert(I.hasNext());
				return I.next(G) == AdvancedSymbol;
				});
				assert(!Batch.empty());
				AllItems = AllItems.drop_front(Batch.size());

				// Advance a dot over the Symbol.
				sammccallUnsubmitted Done Reply Inline Actions Hardcoding the indentation seems a bit special-purpose. Make the indent level an optional parameter to State::dump() and use OS.indent()? sammccall: Hardcoding the indentation seems a bit special-purpose. Make the indent level an optional…
				ItemSet Next;
				for (const Item &I : Batch)
				Next.push_back(I.advance());
				// sort the set to keep order determinism for hash computation.
				llvm::sort(Next);
				Results.push_back({AdvancedSymbol, std::move(Next)});
				}
				return Results;
				}

				} // namespace

				std::string Item::dump(const Grammar &G) const {
				const auto &Rule = G.lookupRule(RID);
				sammccallUnsubmitted Done Reply Inline Actions nit: more spaces would make this more readable `{0} ->[{1}] {2}` or even `{0} --[{1}]-> {2}`? sammccall: nit: more spaces would make this more readable `{0} ->[{1}] {2}` or even `{0} --[{1}]-> {2}`?
				auto ToNames = [&](llvm::ArrayRef<SymbolID> Syms) {
				std::vector<llvm::StringRef> Results;
				for (auto SID : Syms)
				Results.push_back(G.symbolName(SID));
				return Results;
				};
				return llvm::formatv("{0} := {1} • {2}", G.symbolName(Rule.Target),
				llvm::join(ToNames(Rule.seq().take_front(DotPos)), " "),
				llvm::join(ToNames(Rule.seq().drop_front(DotPos)), " "))
				.str();
				}

				std::string State::dump(const Grammar &G, unsigned Indent) const {
				std::string Result;
				llvm::raw_string_ostream OS(Result);
				for (const auto &Item : Items)
				OS.indent(Indent) << llvm::formatv("{0}\n", Item.dump(G));
				return OS.str();
				}

				std::string LRGraph::dumpForTests(const Grammar &G) const {
				std::string Result;
				llvm::raw_string_ostream OS(Result);
				OS << "States:\n";
				for (StateID ID = 0; ID < States.size(); ++ID) {
				OS << llvm::formatv("State {0}\n", ID);
				OS << States[ID].dump(G, /Indent/ 4);
				}
				for (const auto &E : Edges) {
				OS << llvm::formatv("{0} ->[{1}] {2}\n", E.Src, G.symbolName(E.Label),
				E.Dst);
				}
				return OS.str();
				}

				LRGraph LRGraph::buildLR0(const Grammar &G) {
				class Builder {
				public:
				Builder(const Grammar &G) : G(G) {}

				// Adds a given state if not existed.
				std::pair<StateID, /inserted/ bool> insert(ItemSet KernelItems) {
				sammccallUnsubmitted Done Reply Inline Actions This really feels like a dubious optimization to me as it's incorrect if we ever get a hash collision, and the kernel ItemSet is much smaller than `States` which we're already storing. This seems like it must only save 20% or something of overall size? In the other review thread you compared the size of DenseMap<hash_code, size_t> vs DenseMap<ItemSet, size_t>, but really the total size including `States` seems like a more useful comparison. sammccall: This really feels like a dubious optimization to me as it's incorrect if we ever get a hash…
				hokeinAuthorUnsubmitted Done Reply Inline Actions ok, changed to `DenseMap<ItemSet, size_t>` instead. The size of `States` is ~215KB, so 15% `DenseMap<hash_code>` vs 28% for `DenseMap<ItemSet>`, not huge. hokein: ok, changed to `DenseMap<ItemSet, size_t>` instead. The size of `States` is ~215KB, so 15%…
				assert(llvm::is_sorted(KernelItems) &&
				"Item must be sorted before inserting to a hash map!");
				auto It = StatesIndex.find(KernelItems);
				if (It != StatesIndex.end())
				return {It->second, false};
				States.push_back(closure(KernelItems, G));
				StateID NextStateID = States.size() - 1;
				StatesIndex.insert({std::move(KernelItems), NextStateID});
				return {NextStateID, true};
				}

				void insertEdge(StateID Src, StateID Dst, SymbolID Label) {
				Edges.push_back({Src, Dst, Label});
				}

				// Returns a state with the given id.
				const State &find(StateID ID) const {
				assert(ID < States.size());
				return States[ID];
				}

				LRGraph build() && {
				States.shrink_to_fit();
				Edges.shrink_to_fit();
				return LRGraph(std::move(States), std::move(Edges));
				}

				private:
				// Key is the kernel item sets.
				llvm::DenseMap<ItemSet, /index of States/ size_t> StatesIndex;
				std::vector<State> States;
				std::vector<Edge> Edges;
				const Grammar &G;
				} Builder(G);

				std::vector<StateID> PendingStates;
				// Initialize states with the start symbol.
				auto RRange = G.table().Nonterminals[G.startSymbol()].RuleRange;
				for (RuleID RID = RRange.start; RID < RRange.end; ++RID) {
				auto StartState = std::vector<Item>{Item::start(RID, G)};
				auto Result = Builder.insert(std::move(StartState));
				assert(Result.second && "State must be new");
				PendingStates.push_back(Result.first);
				}

				while (!PendingStates.empty()) {
				auto CurrentStateID = PendingStates.back();
				PendingStates.pop_back();
				for (auto Next :
				nextAvailableKernelItems(Builder.find(CurrentStateID), G)) {
				auto Insert = Builder.insert(Next.second);
				if (Insert.second) // new state, insert to the pending queue.
				PendingStates.push_back(Insert.first);
				Builder.insertEdge(CurrentStateID, Insert.first, Next.first);
				}
				}
				return std::move(Builder).build();
				}

				} // namespace pseudo
				} // namespace syntax
				} // namespace clang

clang/unittests/Tooling/Syntax/Pseudo/CMakeLists.txt

	set(LLVM_LINK_COMPONENTS			set(LLVM_LINK_COMPONENTS
	Support			Support
	)			)

	add_clang_unittest(ClangPseudoTests			add_clang_unittest(ClangPseudoTests
	GrammarTest.cpp			GrammarTest.cpp
				LRGraphTest.cpp
	)			)

	clang_target_link_libraries(ClangPseudoTests			clang_target_link_libraries(ClangPseudoTests
	PRIVATE			PRIVATE
	clangBasic			clangBasic
	clangLex			clangLex
	clangToolingSyntaxPseudo			clangToolingSyntaxPseudo
	clangTesting			clangTesting
	)			)

	target_link_libraries(ClangPseudoTests			target_link_libraries(ClangPseudoTests
	PRIVATE			PRIVATE
	LLVMTestingSupport			LLVMTestingSupport
	)			)

clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp

This file was added.

				//===--- LRGraphTest.cpp - LRGraph tests -------------------------- C++--===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "clang/Tooling/Syntax/Pseudo/LRGraph.h"
				#include "gmock/gmock.h"
				#include "gtest/gtest.h"
				#include <memory>

				namespace clang {
				namespace syntax {
				namespace pseudo {
				namespace {

				TEST(LRGraph, Build) {
				struct TestCase {
				llvm::StringRef BNF;
				llvm::StringRef ExpectedStates;
				};

				TestCase Cases[] = {{
				R"bnf(
				_ := expr
				expr := IDENTIFIER
				)bnf",
				R"(States:
				State 0
				_ := • expr
				expr := • IDENTIFIER
				State 1
				_ := expr •
				State 2
				expr := IDENTIFIER •
				0 ->[expr] 1
				0 ->[IDENTIFIER] 2
				)"},
				{// A grammar with a S/R conflict in SLR table:
				// (id-id)-id, or id-(id-id).
				R"bnf(
				_ := expr
				expr := expr - expr # S/R conflict at state 4 on '-' token
				expr := IDENTIFIER
				)bnf",
				R"(States:
				State 0
				_ := • expr
				expr := • expr - expr
				expr := • IDENTIFIER
				State 1
				_ := expr •
				expr := expr • - expr
				State 2
				expr := IDENTIFIER •
				State 3
				expr := • expr - expr
				expr := expr - • expr
				expr := • IDENTIFIER
				State 4
				expr := expr - expr •
				expr := expr • - expr
				0 ->[expr] 1
				0 ->[IDENTIFIER] 2
				1 ->[-] 3
				3 ->[expr] 4
				3 ->[IDENTIFIER] 2
				4 ->[-] 3
				)"}};
				for (const auto &C : Cases) {
				std::vector<std::string> Diags;
				auto G = Grammar::parseBNF(C.BNF, Diags);
				ASSERT_THAT(Diags, testing::IsEmpty());
				auto LR0 = LRGraph::buildLR0(*G);
				EXPECT_EQ(LR0.dumpForTests(*G), C.ExpectedStates);
				}
				}

				} // namespace
				} // namespace pseudo
				} // namespace syntax
				} // namespace clang

This is an archive of the discontinued LLVM Phabricator instance.

[pseudo] Implement LRGraphClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 407100

clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h

clang/lib/Tooling/Syntax/Pseudo/CMakeLists.txt

clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp

clang/unittests/Tooling/Syntax/Pseudo/CMakeLists.txt

clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp

[pseudo] Implement LRGraph
ClosedPublic