This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
include/clang/Tooling/Syntax/Pseudo/
-
clang/
-
Tooling/
-
Syntax/
-
Pseudo/
6/8
Grammar.h
24/30
LRTable.h
-
lib/Tooling/Syntax/Pseudo/
-
Tooling/
-
Syntax/
-
Pseudo/
-
CMakeLists.txt
6/7
Grammar.cpp
-
GrammarBNF.cpp
8/10
LRTable.cpp
2/4
LRTableBuild.cpp
-
test/Syntax/
-
Syntax/
-
check-cxx-bnf.test
-
lr-build-basic.test
-
lr-build-conflicts.test
-
tools/clang-pseudo/
-
clang-pseudo/
1
ClangPseudo.cpp
-
unittests/Tooling/Syntax/Pseudo/
-
Tooling/
-
Syntax/
-
Pseudo/
-
CMakeLists.txt
-
LRGraphTest.cpp
-
LRTableTest.cpp

Differential D118196

[syntax][pseudo] Implement LR parsing table.
ClosedPublic

Authored by hokein on Jan 25 2022, 2:46 PM.

Download Raw Diff

Details

Reviewers

sammccall

Commits

rGa2fab82f33bb: [pseudo] Implement LRTable.

Summary

This patch introduces a dense implementation of the LR parsing table, which is
used by LR parsers. We implement a SLR(1) parsing table from an LR(0) automaton.

Statistics of the LR parsing table on the C++ spec grammar:

number of states: 1449
number of actions: 83069
number of index entries: 72612
size of the table (bytes): 334928

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hokein created this revision.Jan 25 2022, 2:46 PM

Herald added a subscriber: mgorny. · View Herald TranscriptJan 25 2022, 2:46 PM

hokein requested review of this revision.Jan 25 2022, 2:46 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 25 2022, 2:46 PM

Harbormaster completed remote builds in B145607: Diff 403041.Jan 25 2022, 2:47 PM

Polish the LRTable implementation, using two flat vectors

memory-size (bytes): 1380908 => 664600

Harbormaster completed remote builds in B146573: Diff 404442.Jan 31 2022, 1:41 AM

hokein edited the summary of this revision. (Show Details)Jan 31 2022, 1:41 AM

A few initial comments just on the "easy parts" - changes to grammar and the first/follow set computation.

I need to study the rest. Generally I have a feeling some of the impl is very inefficient but efficiency may not matter much in this context, and it's unclear how much better it can be made without adding lots of complexity. Going to look more closely :-)

clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h
129	This function seems like a bit of an attractive nuisance: it could be a cheap accessor, but isn't. There are just two callers - both are calling it in a loop and both seem dubious to be iterating over all symbols at all. I wonder if we can avoid it entirely.
131	this "augmented symbol" name doesn't seem widely used. (search for `lr "augmented symbol"` produces few results, mostly unrelated). It's also a confusing name, because it's IIUC it's not the symbol that's augmented, but rather the grammar has been augmented with the symbol... I'd suggest just calling it the start symbol, unless the distinction is critical (I don't yet understand why it's important - I added it in the prototype just to avoid having the language-specific name "translation-unit" hardcoded). Otherwise it's hard to suggest a good name without understanding what it's for, but it should be descriptive...
134	In about the same number of words you could explain what this actually is :-) // Computes the set of terminals that could appear at the start of each non-terminal.
134	is there some reason these functions need to be part of the grammar class? They seem like part of the LR building to me, it could be local to LRBuilder.cpp, or exposed from LRTable.h or so for testing
clang/include/clang/Tooling/Syntax/Pseudo/LRAutomaton.h
66 ↗	(On Diff #404442)	why is std::set used here? (generally a data structure to avoid)
75 ↗	(On Diff #404442)	I have a feeling we've discussed this before but... I find this alias/abstraction unhelpful when reading the code, because operations on states are e.g. `state.insert` or `state.contains`, which don't match the abstraction. I'd probably prefer just using `ItemSet`, but alternatively `struct State { ItemSet Items; }` seems like it would work well: `State.Items.contains(...)` reads very naturally.
clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp
49	this function could use some comments. A rule S := T ... implies elements in first(S): - if T is a terminal, first(S) contains T - if T is a nonterminal, first(S) contains first(T) Since first(T) may not have been fully computed yet, first(S) itself may end up being incomplete. We iterate until a fixed point. (This isn't particularly efficient, but table building performance isn't critical).
61	`bool Changed = true; while(Changed) {`? or maybe `for (bool Changed = true; Changed;) {`? (do-while is rare enough that it always takes me a bit to grasp the control flow)
66	comment that we only need to consider the first element because symbols are non-nullable
98	nit: `... Y Z`, with consistent spaces/capitalization Unless case is meant to imply something here?
105	copying the set here seems gratuitous, especially since we're doing this in a loop where !changed is the usual thing. I think inlining ExpandFollowSet for each case would make sense here. It turns into 1 line for the NT case plus two lines for the T case, so it's actually shorter overall
111	I think `if (isNonterminal(Z)) ...expand...` would be clearer here, since the condition is really part of rule 3 (but in a not-totally obvious way, it's nice to see them side-by-side)
clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp
80 ↗	(On Diff #404442)	nit: please be consistent about spacing & capitalization for rules I think trying to use a convention of uppercase/lowercase for term/nonterm is difficult because there's no clear way to indicate that a symbol could be either, which is common.
80 ↗	(On Diff #404442)	I can't really understand this description, it'd be helpful if it didn't reuse the name "advance" (particularly as a noun)
153 ↗	(On Diff #404442)	this loop looks like it's iterating over every possible symbol that might appear next, rather than just the ones that do (and creating a temporary array of 1000 items each outer iteration to do it). Seems like we could do better by instead advancing every rule, and spitting out a list of {symbol, ItemSet} pairs?
157 ↗	(On Diff #404442)	Hmm, and advance() is computing closures from scratch every time, probably just to find "oh, the state already existed". Seems like we could save work by using the kernel set to represent the full set until we actually need to iterate over its items. I'd like to think about this some more.

OK, a bunch of random notes but I'm getting a higher level picture.

I'm not sure about the split between LRAutomaton and LRTable. I suppose doing it in two stages simplifies the implementation, but I don't think the first stage particularly needs to be public.
Also I think the names are confusing: Automaton vs Table isn't a clear contrast as one describes behavior and one a data structure. And the "LRTable" seems more automaton-like to me!

I think my suggestion would be:

call the final structure LRAutomaton
call the LRAutomaton::build()::Builder structure LRGraph or something
I'm not sure the LRAutomaton structure needs to exist exactly, seems like the current LRTable::build() could operate on LRGraph directly

but I'm going to ponder this more.

clang/include/clang/Tooling/Syntax/Pseudo/LRAutomaton.h
32 ↗	(On Diff #404442)	accepting the grammar as a parameter rather than the rule length would let us ensure the length is correct locally rather than relying on the caller. Essentially all the callers need to do a lookup anyway.
45 ↗	(On Diff #404442)	nit: redundant pseudo::qualifier
47 ↗	(On Diff #404442)	nit: unsigned. just dot()? and similarly ruleID()->rule()? Not sure these would really be unclear
79 ↗	(On Diff #404442)	Hmm, isn't the point of GLR that it's a nondeterministic automaton? We've done the LR-style NFA->DFA conversion, but it's not complete (when we hit a conflict we continue rather than giving up)
clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp
144 ↗	(On Diff #404442)	nit: too many braces - please spell out some of the types (and Auto should probably be State here?)
215 ↗	(On Diff #404442)	These comments are echoing the code - saying what, not why. "If we've just parsed the start symbol, we can accept the input".
220 ↗	(On Diff #404442)	Seems like we should maybe have the accepted symbol as a payload on the Accept action? Actually why do we need an Accept action, as opposed to just a reduce action and recognize it because it's the symbol we want and the stack is empty?
224 ↗	(On Diff #404442)	"If we've reached the end of a rule A := ..., then we can reduce if the next token is in the follow set of A"

OK, I think I understand it now! Sorry for so many comments, happy to chat offline about which ones to address/where to start.

clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h
40	API polish: this class is a mix of accessors and direct member state - it can be hard to know how to best interact with such a thing. I'd suggest: public factories `static Action Action::reduce(RuleID)` etc, private data & constructor expose `Kind kind()`, but not the individual is() methods: they don't buy much and this pattern encourages switch where appropriate make `Kind` enum rather than `enum class` - `Action::Shift` is sufficiently scoped already shift() & goTo() -> `targetState()` which asserts either condition. Apart from anything else, this makes naming easier: "shift" is a verb so `shift()` is a confusing name.
40	mention that this combines the Action and Go-to tables from LR literature?
42	I think it's worth documenting what each of these actions means. Referring to enum values as "action table" and "goto table" seems confusing - they're part of the same table! The main reference point when reading the code needs to be the code, rather than the textbook.
68	if you make the data members private the union could just be a uint16 and save yourself some trouble here :-) Up to you
85	This action struct can be squeezed to 16 bits. Kind only needs 3 bits (I suspect it can be 2: Error and Accept aren't heavily used). RuleID is 12 bits, StateID would fit within 11 and 12 should be safe. Combined with the optimization suggested below, this should get the whole table down to 335KB.
103	nit: please be consistent with `Nonterminal` (and "nonterminal") vs `NonTerminal` (and "non-terminal"). I think `Nonterminal` is used elsewhere in the code.
106	Why is this assertion valid? Is it a precondition of the function that some item in From can advance over NonTerminal?
122	I think there's an even more compact representation: Basically sort the index keys as symbol-major, but then don't store the actual symbol there but rather store the range for the symbol in an external array: // index is lookahead tok::TokenKind. Values are indices into Index: the range for tok is StateIdx[TokenIdx[tok]...TokenIdx[tok+1]] vector<unsigned> TokenIdx; // index is nonterminal SymbolID. Values are indices into Index: the range for sym is StateIdx[NontermIdx[sym]...NontermIdx[sym+1]]. vector<unsigned> NontermIdx; // parallel to actions, value is the from state. Only subranges are sorted. vector<StateID> StateIdx; vector<Action> Actions; This should be 4 * (392 + 245) + (2 + 4) * 83k ~= 500k. And the one extra indirection should save 5-10 steps of binary search. This is equivalent to 2 * `vector<unsigned>` + `vector<pair<StateID, Action>>` but keeping the searched index compact is probably a good thing.
clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp
74 ↗	(On Diff #404442)	As far as I can tell, this is the only place where we need to deduplicate/check for membership in the set. WDYT about using a local DenseSet<Item> in `closure()` itself for this, and making the representation of ItemSet a (small)vector? (IIUC the reason for std::set over DenseMap is determinism. Using vector should preserve that: if you really want sorting you can sort after computing closure, but if we just want a well-defined order to ensure stable state numbers then insertion order should work fine)
109 ↗	(On Diff #404442)	using hashcode instead of the key itself should be explained. (If it's just an issue with hash of std::set, I think using std::vector fixes it)
157 ↗	(On Diff #404442)	Yeah, I think this should work. Builder would maintain a map from (sorted kernel items) => stateID, and Builder.insert() would call closure() only if actually inserting.
clang/tools/clang-pseudo/ClangPseudo.cpp
52	I think that we should also have flags to: dump the grammar dump the LR table

hokein mentioned this in D118990: [pseudo] Add first and follow set computation in Grammar..Feb 4 2022, 5:58 AM

hokein marked 7 inline comments as done.Feb 4 2022, 5:59 AM

hokein added inline comments.

clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h
131	yeah, it is about augmented grammar. Strictly speaking, it is a start symbol of augmented symbol. Maybe we should not speak augmented stuff in the code, it is not a well-known term, and causes confusion. rename it to start symbol instead. It looks like using augmented grammar is a "standard" LR technique, it introduces a converged start state and accepting state in the LR graph, which makes it easier for the implementation. And it is a good fit for support multiple start symbols.
134	as discussed, made them as free functions in Grammar.h. Separate the first and follow set changes to https://reviews.llvm.org/D118990, comments around them should be addressed there.
clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp
98	Y was for nonterminal while z was for nonterminal or terminals. They are not super important, changed all to Uppercase.

hokein mentioned this in D119172: [pseudo] Implement LRGraph.Feb 7 2022, 11:36 AM

comments around LRAutomaton should be addressed in the separate patch, https://reviews.llvm.org/D119172.

clang/include/clang/Tooling/Syntax/Pseudo/LRAutomaton.h
32 ↗	(On Diff #404442)	right, but we still need this one, for constructing an empty/Tombstone items for DenseMapInfo.
66 ↗	(On Diff #404442)	the main reason to we want a deterministic order for computing the hash value. Changed to a sorted vector.
79 ↗	(On Diff #404442)	yes and no. There are a few conceptually difference here. GLR is indeed a nondeterministic parser, but I think the term deterministic/nondeterministic in parser is different than the ones used in automaton context. A deterministic parser (typical LR) has a property that there will always be only one possible action to choose from, thus it is fast (linear), and it produces only one syntax tree; A deterministic automaton is described here. We've done the LR-style NFA->DFA conversion, but it's not complete (when we hit a conflict we continue rather than giving up) Not exactly, there is no NFA->DFA conversion here. What this patch does: we constructs a deterministic LR(0) automaton (DFA) directly, this is what LRAutomaton represents; based on the DFA, we construct an LR table (called Action and GoTo Table in standard LR). In particular, we use the FollowSet to determine reduce actions, it is called SLR(1) table which is powerful the the LR(0) table. A conflict in the LRTable doesn't imply an incomplete DFA. For an arbitrary grammar (no matter it is ambiguous or not), we can always construct a DFA. For example, based on the same LR(0) automaton, we could build a SLR(1) table (without conflicts) or LR(0) table (with conflicts). The real problem lies in the interpretation of states of the automaton, considering a node in the automaton with state 0 E := T. + E E := T. and the node as a "+" outer edge, if the next symbol is `+`, we have two options "reduce E := T" or shift `+`. SLR can tell reduction is not available `+` is not in the FOLLOW(E) (thus no conflict) while LR(0) will accept both (thus conflicts). I added some bits in the file comment of the LRGraph.h, hope it can clarify these concepts.
clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp
109 ↗	(On Diff #404442)	using the ItemSet as key seems heavy (for cxx.grammar, 65KB for `llvm::DenseMap<itemset, ..>` vs 32KB for `llvm::DenseMap<hash_code>`) and unnecessary (we don't need to access the Itemset). Added comments.
144 ↗	(On Diff #404442)	too many braces - please spell out some of the types this is one of cons of using inlay-hints :(

hokein mentioned this in rGfe932a88e970: [pseudo] Add first and follow set computation in Grammar..Feb 9 2022, 12:16 AM

hokein mentioned this in rGf1984b143367: [pseudo] Implement LRGraph.Feb 9 2022, 2:20 AM

rebase, rescope the patch to LRTable
refine the Action class interfaces, and nameing;
use a more compact Index, reduce LRTable from 660KB => 335KB;
address review comments;

Herald added a subscriber: mgrang. · View Herald TranscriptFeb 11 2022, 6:25 AM

hokein added inline comments.Feb 11 2022, 6:27 AM

clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h
85	squeezed to 16 bits (3 bits for Kind, and 13 bits for the Value).
106	This is guaranteed by a DFA (a node can not have two out edges with a same label), the same reason why we don't have shift/shift conflicts.
122	This looks like a nice improvement, though the implementation is a little tricky -- we reduced the binary-search space from 2 dimension (State, Symbol) to 1 dimension Symbol.

hokein edited the summary of this revision. (Show Details)Feb 11 2022, 6:27 AM

Harbormaster completed remote builds in B148973: Diff 407862.Feb 11 2022, 7:03 AM

This is really nice, mostly nits.

clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h
179	why is this exposed/required rather than being initialized by the GrammarTable constructor? Since this is essentially static (must always correspond to tok::TokenKind) it seems that GrammarTable could just have an ArrayRef and it could be initialized by a lazy-init singleton: // in Grammar.cpp static ArrayRef<std::string> getTerminalNames() { static std::vector<std::string> TerminalNames = []{ }; return TerminalNames; } (I know eventually we'd like GrammarTable to be generated and have very minimal dynamic init, but there are lots of other details needed e.g. we can't statically initialize `vector<Rule>` either, so we should cross that bridge when we come to it)
clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h
29	space -> sparse
59	Maybe mention relation to GLR: GLR can execute these in parallel?
68	nit: I think you want this to be a comment on the section rather than on Error, leave a blank line?
69	Error seems like an action we'd dynamically handle, rather than an unreachable sentinel. I'd prefer to call this sentinel and llvm_unreachable() on it in the relevant places. Even if we do plan dynamic error actions, we have enough spare bits to keep these two cases separate.
81	again blank line between "nonterminal actions" and "go to state n"
81	pust -> push? I thought goto replaces the top of the stack rather than being pushed onto it. e.g. stack is { stmt := . expr ; }, lookahead is IDENT, action is shift "expr := IDENT ." stack is { stmt := . expr ; \| expr := IDENT . }, lookahead is semi, action is reduce stack is { stmt := . expr ; }, reduced symbol is expr, goto is state "stmt := expr . ;" stack is { stmt := expr . ;}, lookahead is semi... Line 3=>4 doesn't seem like a push
107	maybe value()=>opaque() or asInteger()? Partly to give a hint a bit more at the purpose, partly to avoid confusion of `Value != value()`
115	static assert RuleBits <= ValueBits, and I think we want a StateBits above plus some assertions on the number of states when building
115	FWIW I've mostly seen just `unsigned` as the field type for bitfields. It seems a little confusing to explicitly specify the size both as 16 and 13 bits. Up to you, but if you change this one also consider the Kind enum.
137	This is quite a lot of surface (together with the DenseMapInfo stuff) to expose. Currently it looks like it needs to be in the header so that the LRTableTest can construct a table directly rather than going through LRGraph. I think you could expose a narrower interface: struct Entry { ;... }; // Specify exact table contents for testing. static LRTable buildForTests(ArrayRef<Entry>); Then the builder/densemapinfo can be moved to the cpp file, and both `buildForTests` and `buildSLR` can use the builder internally. WDYT?
155	maybe "value is the offset into States/Actions where the entries for this symbol begin".
164	Reading the building/lookup code, "index" and "idx" are used so often they're hard to follow. I think calling this "States", parallel to "Actions", would be clearer. Similarly maybe the NontermIdx -> NontermOffset (Offset into States/Actions arrays where we find entries for this nonterminal). Then we can just use the word "index" for actual subscripts.
166	concepetually -> conceptually. I think mixing this comment in with the concrete data is confusing. Maybe lift it to the top of the private section like: // Conceptually this is a multimap from (State, SymbolID) => action. // In the literature this is often a table (and multiple entries, i.e. conflicts are forbidden). // Our physical representation is quite different for compactness.
clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp
26	as things currently stand this should be unreachable - these values should never escape the DenseMap. assert or llvm_unreachable?
98	(moving the comment thread to the new code location) The DFA input guarantees Result.size() <= 1, but why can't it be zero? If this is a requirement on the caller, mention it in the header?
163	nit: "assign" seems clearer than "resize"
166	We end up with holes because we're looping over the values rather than the keys. This also feels quite indirect. Seems clear enough to reverse this, something like: unsigned Pos = 0; for (unsigned NT = 0; TK < GT.Nonterminals.size(); ++I) { NontermIdx[NT] = Pos; while (Pos < Sorted.size() && Sorted[Pos].Action == NT) ++Pos; } NontermIdx.back() = Pos; // and the same for terminals
clang/lib/Tooling/Syntax/Pseudo/LRTableBuild.cpp
20	(I think it would be worth moving this into LRTable in order to hide the Builder type from the header...)
32	(rephrasing a comment that got lost earlier in the review) I'm not totally clear on the scope/purpose of the accept action: if it's meant to be sufficient to know whether the parse succeeded, I think it needs the symbol we accepted as a payload. The grammar will accept them all, but the caller will place restrictions. if we're OK inspecting the stack and seeing our last reduction was a Decl or so, why doens't the parser just do that at the end of the stream and do away with `Accept` altogether? (This would avoid the need to splice `eof` tokens into token streams to parse subranges of them).
clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp
0	Seems sensible to combine the table-building tests in here, but it does make the organization of the tests hard. (Particularly since LRTableTests.cpp exists but the most important tests are in this file instead). Since these cross module boundaries, and they are exactly "bundle of text in, bundle of text out"... how do you feel about making them lit tests instead? # RUN: clang-pseudo -grammar %s -print-graph \| FileCheck --check-prefix=GRAPH # RUN: clang-pseudo -grammar %s -print-table \| FileCheck --check-prefix=TABLE _ := expr ... # GRAPH: States: # GRAPH-NEXT: ... # TABLE: LRTable: # TABLE-NEXT: ...

address comments
more api polishments
LRGraph unittest => lit tests
hide LRTable::Builder, move to LRTableBuild.cpp

update

clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h
179	fair enough, let's hide it in a GrammarTable contructor.
clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h
69	changed to Sentinel, and remove Error action for now.
81	yeah, this is mostly right -- stack in step 2 has two frames, rather than a combined one. The truth is: stack is [{ _ := . stmt \| stmt := . expr ; }], lookahead is IDENT, action is shift "expr := IDENT ." (push a new state to the stack) stack is [{ _ := . stmt \| stmt := . expr ; }, { expr := IDENT . }], lookahead is semi, action is reduce stack is [{ _ := . stmt \| stmt := . expr ; }], reduced symbol is expr, goto is state "stmt := expr . ;" stack is [{ _ := . stmt \| stmt := . expr ;}, { stmt := expr . ;}], lookahead is semi, action is shift "stmt := expr ; ." stack is [{ _ := . stmt \| stmt := . expr ;}, { stmt := expr . ;}, { stmt := expr ; .}], lookahead is eof, action is reduce "stmt := expr ;" stack is [{ _ := . stmt \| stmt := . expr ;}], reduced symbol is stmt, goto state is "_ := stmt ." stack is [{ _ := . stmt \| stmt := . expr ;}, { _ := stmt .}], lookahead is eof, action is accept, the parsing is done Step 3 => 4 is a push.
107	renamed to `opaque`.
137	Yeah, I exposed the Builder mainly for testing purposes. The new interface looks good to me.
clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp
98	yeah, getGoToState and getActions are expected to be called by the GLR parser, the parser itself should guarantee it during parsing. Added comments.
166	yeah, this is much better!
clang/lib/Tooling/Syntax/Pseudo/LRTableBuild.cpp
20	I tend to move all build bits (Builder, BuildSLRTable, BuildFor tests) to this file rather than `LRTable.cpp`.
32	The accept action is following what is described in the standard LR literature. Yeah, the main purpose of the accept action is to tell us whether a parse succeeded. We could add a ruleID as a payload for accept action, but I'm not sure whether it would be useful, the associated rule is about `_` (e.g. `_ := translation_unit`) -- we are interested in `translation_unit`, this can be retrieved from the forest. I think the both options are mostly equivalent (the 1st option is a standard LR implementation, the 2rd one seems more add-hoc)-- we can treat accept action as a "special" reduce action of the rule "_ := translation_unit" except that we don't do the reduce, we just stop parsing. I think we probably will revisit later -- things become tricker, if we start supporting snippets (the start symbol `_` will have multiple rules), e.g. for the follow grammar, _ := stmt _ := expr stmt := expr expr := ID the input "ID" can be parsed as a "stmt" or an "expr", if we use a unified state `{ _ := . stmt \| _ := . expr }` to start, we will have a new-type conflict (accept/reduce). I don't think we want to handle this kind of new conflicts. There are some options: careful manage the `_` rules, to make sure no such conflicts happened; introduce a new augmented grammar `__ := _` (the accept/reduce would become reduce/reduce conflict); use separate start state per `_` rule, and callers need to pass a targeting non-terminal to the parser; Since there are some uncertain bits here, I'd prefer not doing further changes in this patch (but happy to add a ruleID payload for accept action). WDYT?
clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp
0	I don't have strong opinion about these, both of them work. I think having options in the `clang-pseudo` tool to dump graph/table is a helpful feature for debugging as well, using lit tests might make more sense.

Harbormaster completed remote builds in B150399: Diff 409893.Feb 18 2022, 4:08 AM

Thanks, I really like the way the tests look now!

clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h
79	// Pops
81	parsd -> parsed
clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp
29	nit: goTo -> go to, like dumpForTests?

This revision is now accepted and ready to land.Feb 21 2022, 1:24 AM

address comments

This revision was landed with ongoing or failed builds.Feb 23 2022, 12:21 AM

Closed by commit rGa2fab82f33bb: [pseudo] Implement LRTable. (authored by hokein). · Explain Why

This revision was automatically updated to reflect the committed changes.

hokein added a commit: rGa2fab82f33bb: [pseudo] Implement LRTable..

Harbormaster completed remote builds in B151004: Diff 410725.Feb 23 2022, 12:47 AM

Hi, this commit is causing runtime failures on Windows in debug builds. Can you please correct or revert? Thanks!

clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp
119	This is causing an assertion with debug builds on Windows because `Actions[End]` is out of bounds (so the MSVC STL debug assertions catch the issue) for the test cases in this patch.

In D118196#3341110, @aaron.ballman wrote:

Hi, this commit is causing runtime failures on Windows in debug builds. Can you please correct or revert? Thanks!

sorry for the trouble. I'm mostly running out of time today, I will revert this patch, and fix it tomorrow.

In D118196#3341159, @hokein wrote:

In D118196#3341110, @aaron.ballman wrote:

Hi, this commit is causing runtime failures on Windows in debug builds. Can you please correct or revert? Thanks!

sorry for the trouble. I'm mostly running out of time today, I will revert this patch, and fix it tomorrow.

Thanks!

hokein added inline comments.Feb 23 2022, 12:36 PM

clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp
119	this should be `llvm::makeArrayRef(&Actions[Start], End - Start)`. Fixed in 302ca279cb83043ef7d60115eb5ba58f12064a4a.

aaron.ballman added inline comments.Feb 23 2022, 12:45 PM

clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp
119	I can confirm that the issue is now fixed for me, thank you!

Revision Contents

Path

Size

clang/

include/

clang/

Tooling/

Syntax/

Pseudo/

Grammar.h

4 lines

LRTable.h

182 lines

lib/

Tooling/

Syntax/

Pseudo/

4 lines

17 lines

12 lines

124 lines

143 lines

test/

Syntax/

check-cxx-bnf.test

2 lines

lr-build-basic.test

24 lines

lr-build-conflicts.test

47 lines

tools/

clang-pseudo/

ClangPseudo.cpp

43 lines

unittests/

Tooling/

Syntax/

Pseudo/

CMakeLists.txt

2 lines

LRGraphTest.cpp

LRTableTest.cpp

56 lines

Diff 410729

clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h

Show First 20 Lines • Show All 120 Lines • ▼ Show 20 Lines	public:
// Returns the SymbolID of the start symbol '_'.		// Returns the SymbolID of the start symbol '_'.
SymbolID startSymbol() const { return StartSymbol; };		SymbolID startSymbol() const { return StartSymbol; };

// Returns all rules of the given non-terminal symbol.		// Returns all rules of the given non-terminal symbol.
llvm::ArrayRef<Rule> rulesFor(SymbolID SID) const;		llvm::ArrayRef<Rule> rulesFor(SymbolID SID) const;
const Rule &lookupRule(RuleID RID) const;		const Rule &lookupRule(RuleID RID) const;

// Gets symbol (terminal or non-terminal) name.		// Gets symbol (terminal or non-terminal) name.
// Terminals have names like "," (kw_comma) or "OPERATOR" (kw_operator).		// Terminals have names like "," (kw_comma) or "OPERATOR" (kw_operator).
		sammccallUnsubmitted Not Done Reply Inline Actions This function seems like a bit of an attractive nuisance: it could be a cheap accessor, but isn't. There are just two callers - both are calling it in a loop and both seem dubious to be iterating over all symbols at all. I wonder if we can avoid it entirely. sammccall: This function seems like a bit of an attractive nuisance: it could be a cheap accessor, but…
llvm::StringRef symbolName(SymbolID) const;		llvm::StringRef symbolName(SymbolID) const;

		sammccallUnsubmitted Not Done Reply Inline Actions this "augmented symbol" name doesn't seem widely used. (search for `lr "augmented symbol"` produces few results, mostly unrelated). It's also a confusing name, because it's IIUC it's not the symbol that's augmented, but rather the grammar has been augmented with the symbol... I'd suggest just calling it the start symbol, unless the distinction is critical (I don't yet understand why it's important - I added it in the prototype just to avoid having the language-specific name "translation-unit" hardcoded). Otherwise it's hard to suggest a good name without understanding what it's for, but it should be descriptive... sammccall: this "augmented symbol" name doesn't seem widely used. (search for `lr "augmented symbol"`…
		hokeinAuthorUnsubmitted Done Reply Inline Actions yeah, it is about augmented grammar. Strictly speaking, it is a start symbol of augmented symbol. Maybe we should not speak augmented stuff in the code, it is not a well-known term, and causes confusion. rename it to start symbol instead. It looks like using augmented grammar is a "standard" LR technique, it introduces a converged start state and accepting state in the LR graph, which makes it easier for the implementation. And it is a good fit for support multiple start symbols. hokein: yeah, it is about augmented grammar. Strictly speaking, it is a start symbol of augmented…
// Dumps the whole grammar.		// Dumps the whole grammar.
std::string dump() const;		std::string dump() const;
// Dumps a particular rule.		// Dumps a particular rule.
		sammccallUnsubmitted Done Reply Inline Actions In about the same number of words you could explain what this actually is :-) // Computes the set of terminals that could appear at the start of each non-terminal. sammccall: In about the same number of words you could explain what this actually is :-) // Computes the…
		sammccallUnsubmitted Done Reply Inline Actions is there some reason these functions need to be part of the grammar class? They seem like part of the LR building to me, it could be local to LRBuilder.cpp, or exposed from LRTable.h or so for testing sammccall: is there some reason these functions need to be part of the grammar class? They seem like part…
		hokeinAuthorUnsubmitted Done Reply Inline Actions as discussed, made them as free functions in Grammar.h. Separate the first and follow set changes to https://reviews.llvm.org/D118990, comments around them should be addressed there. hokein: as discussed, made them as free functions in Grammar.h. Separate the first and follow set…
std::string dumpRule(RuleID) const;		std::string dumpRule(RuleID) const;
// Dumps all rules of the given nonterminal symbol.		// Dumps all rules of the given nonterminal symbol.
std::string dumpRules(SymbolID) const;		std::string dumpRules(SymbolID) const;

const GrammarTable &table() const { return *T; }		const GrammarTable &table() const { return *T; }

private:		private:
std::unique_ptr<GrammarTable> T;		std::unique_ptr<GrammarTable> T;
// The start symbol '_' of the augmented grammar.		// The start symbol '_' of the augmented grammar.
SymbolID StartSymbol;		SymbolID StartSymbol;
};		};
// For each nonterminal X, computes the set of terminals that begin strings		// For each nonterminal X, computes the set of terminals that begin strings
// derived from X. (Known as FIRST sets in grammar-based parsers).		// derived from X. (Known as FIRST sets in grammar-based parsers).
std::vector<llvm::DenseSet<SymbolID>> firstSets(const Grammar &);		std::vector<llvm::DenseSet<SymbolID>> firstSets(const Grammar &);
// For each nonterminal X, computes the set of terminals that could immediately		// For each nonterminal X, computes the set of terminals that could immediately
// follow X. (Known as FOLLOW sets in grammar-based parsers).		// follow X. (Known as FOLLOW sets in grammar-based parsers).
std::vector<llvm::DenseSet<SymbolID>> followSets(const Grammar &);		std::vector<llvm::DenseSet<SymbolID>> followSets(const Grammar &);

// Storage for the underlying data of the Grammar.		// Storage for the underlying data of the Grammar.
// It can be constructed dynamically (from compiling BNF file) or statically		// It can be constructed dynamically (from compiling BNF file) or statically
// (a compiled data-source).		// (a compiled data-source).
struct GrammarTable {		struct GrammarTable {
		GrammarTable();

struct Nonterminal {		struct Nonterminal {
std::string Name;		std::string Name;
// Corresponding rules that construct the non-terminal, it is a [start, end)		// Corresponding rules that construct the non-terminal, it is a [start, end)
// index range of the Rules table.		// index range of the Rules table.
struct {		struct {
RuleID start;		RuleID start;
RuleID end;		RuleID end;
} RuleRange;		} RuleRange;
};		};

// The rules are sorted (and thus grouped) by target symbol.		// The rules are sorted (and thus grouped) by target symbol.
// RuleID is the index of the vector.		// RuleID is the index of the vector.
std::vector<Rule> Rules;		std::vector<Rule> Rules;
// A table of terminals (aka tokens). It corresponds to the clang::Token.		// A table of terminals (aka tokens). It corresponds to the clang::Token.
// clang::tok::TokenKind is the index of the table.		// clang::tok::TokenKind is the index of the table.
std::vector<std::string> Terminals;		llvm::ArrayRef<std::string> Terminals;
// A table of nonterminals, sorted by name.		// A table of nonterminals, sorted by name.
// SymbolID is the index of the table.		// SymbolID is the index of the table.
std::vector<Nonterminal> Nonterminals;		std::vector<Nonterminal> Nonterminals;
};		};

		sammccallUnsubmitted Done Reply Inline Actions why is this exposed/required rather than being initialized by the GrammarTable constructor? Since this is essentially static (must always correspond to tok::TokenKind) it seems that GrammarTable could just have an ArrayRef and it could be initialized by a lazy-init singleton: // in Grammar.cpp static ArrayRef<std::string> getTerminalNames() { static std::vector<std::string> TerminalNames = []{ }; return TerminalNames; } (I know eventually we'd like GrammarTable to be generated and have very minimal dynamic init, but there are lots of other details needed e.g. we can't statically initialize `vector<Rule>` either, so we should cross that bridge when we come to it) sammccall: why is this exposed/required rather than being initialized by the GrammarTable constructor?
		hokeinAuthorUnsubmitted Done Reply Inline Actions fair enough, let's hide it in a GrammarTable contructor. hokein: fair enough, let's hide it in a GrammarTable contructor.
} // namespace pseudo		} // namespace pseudo
} // namespace syntax		} // namespace syntax
} // namespace clang		} // namespace clang

#endif // LLVM_CLANG_TOOLING_SYNTAX_GRAMMAR_H		#endif // LLVM_CLANG_TOOLING_SYNTAX_GRAMMAR_H

clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h

This file was added.

				//===--- LRTable.h - Define LR Parsing Table ---------------------- C++--===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// The LRTable (referred as LR parsing table in the LR literature) is the core
				// component in LR parsers, it drives the LR parsers by specifying an action to
				// take given the current state on the top of the stack and the current
				// lookahead token.
				//
				// The LRTable can be described as a matrix where the rows represent
				// the states of the LR graph, the columns represent the symbols of the
				// grammar, and each entry of the matrix (called action) represents a
				// state transition in the graph.
				//
				// Typically, based on the category of the grammar symbol, the LRTable is
				// broken into two logically separate tables:
				// - ACTION table with terminals as columns -- e.g ACTION[S, a] specifies
				// next action (shift/reduce/accept/error) on state S under a lookahead
				// terminal a
				// - GOTO table with nonterminals as columns -- e.g. GOTO[S, X] specify
				// the state which we transist to from the state S with the nonterminal X
				//
				// LRTable is performance-critial as it is consulted frequently during a
				// parse. In general, LRTable is very sparse (most of the entries are empty).
				// For example, for the C++ language, the SLR table has ~1500 states and 650
				sammccallUnsubmitted Done Reply Inline Actions space -> sparse sammccall: space -> sparse
				// symbols which results in a matrix having 975K entries, ~90% of entries are
				// empty.
				//
				// This file implements a speed-and-space-efficient LRTable.
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_CLANG_TOOLING_SYNTAX_PSEUDO_LRTABLE_H
				#define LLVM_CLANG_TOOLING_SYNTAX_PSEUDO_LRTABLE_H

				#include "clang/Tooling/Syntax/Pseudo/Grammar.h"
				sammccallUnsubmitted Done Reply Inline Actions API polish: this class is a mix of accessors and direct member state - it can be hard to know how to best interact with such a thing. I'd suggest: public factories `static Action Action::reduce(RuleID)` etc, private data & constructor expose `Kind kind()`, but not the individual is() methods: they don't buy much and this pattern encourages switch where appropriate make `Kind` enum rather than `enum class` - `Action::Shift` is sufficiently scoped already shift() & goTo() -> `targetState()` which asserts either condition. Apart from anything else, this makes naming easier: "shift" is a verb so `shift()` is a confusing name. sammccall: API polish: this class is a mix of accessors and direct member state - it can be hard to know…
				sammccallUnsubmitted Done Reply Inline Actions mention that this combines the Action and Go-to tables from LR literature? sammccall: mention that this combines the Action and Go-to tables from LR literature?
				#include "llvm/ADT/ArrayRef.h"
				#include <cstdint>
				sammccallUnsubmitted Done Reply Inline Actions I think it's worth documenting what each of these actions means. Referring to enum values as "action table" and "goto table" seems confusing - they're part of the same table! The main reference point when reading the code needs to be the code, rather than the textbook. sammccall: I think it's worth documenting what each of these actions means. Referring to enum values as…
				#include <vector>

				namespace clang {
				namespace syntax {
				namespace pseudo {

				// Represents the LR parsing table, which can efficiently the question "what is
				// the next step given the lookahead token and current state on top of the
				// stack?".
				//
				// This is a dense implementation, which only takes an amount of space that is
				// proportional to the number of non-empty entries in the table.
				//
				// Unlike the typical LR parsing table which allows at most one available action
				// per entry, conflicted actions are allowed in LRTable. The LRTable is designed
				// to be used in nondeterministic LR parsers (e.g. GLR).
				class LRTable {
				sammccallUnsubmitted Done Reply Inline Actions Maybe mention relation to GLR: GLR can execute these in parallel? sammccall: Maybe mention relation to GLR: GLR can execute these in parallel?
				public:
				// StateID is only 13 bits wide.
				using StateID = uint16_t;
				static constexpr unsigned StateBits = 13;

				// Action represents the terminal and nonterminal actions, it combines the
				// entry of the ACTION and GOTO tables from the LR literature.
				class Action {
				public:
				sammccallUnsubmitted Done Reply Inline Actions if you make the data members private the union could just be a uint16 and save yourself some trouble here :-) Up to you sammccall: if you make the data members private the union could just be a uint16 and save yourself some…
				sammccallUnsubmitted Done Reply Inline Actions nit: I think you want this to be a comment on the section rather than on Error, leave a blank line? sammccall: nit: I think you want this to be a comment on the section rather than on Error, leave a blank…
				enum Kind : uint8_t {
				sammccallUnsubmitted Not Done Reply Inline Actions Error seems like an action we'd dynamically handle, rather than an unreachable sentinel. I'd prefer to call this sentinel and llvm_unreachable() on it in the relevant places. Even if we do plan dynamic error actions, we have enough spare bits to keep these two cases separate. sammccall: Error seems like an action we'd dynamically handle, rather than an unreachable sentinel. I'd…
				hokeinAuthorUnsubmitted Done Reply Inline Actions changed to Sentinel, and remove Error action for now. hokein: changed to Sentinel, and remove Error action for now.
				Sentinel = 0,
				// Terminal actions, corresponding to entries of ACTION table.

				// Shift to state n: move forward with the lookahead, and push state n
				// onto the state stack.
				// A shift is a forward transition, and the value n is the next state that
				// the parser is to enter.
				Shift,
				// Reduce by a rule: pop the state stack.
				Reduce,
				sammccallUnsubmitted Done Reply Inline Actions // Pops sammccall: // Pops
				// Signals that we have parsed the input successfully.
				Accept,
				sammccallUnsubmitted Not Done Reply Inline Actions again blank line between "nonterminal actions" and "go to state n" sammccall: again blank line between "nonterminal actions" and "go to state n"
				sammccallUnsubmitted Not Done Reply Inline Actions pust -> push? I thought goto replaces the top of the stack rather than being pushed onto it. e.g. stack is { stmt := . expr ; }, lookahead is IDENT, action is shift "expr := IDENT ." stack is { stmt := . expr ; \| expr := IDENT . }, lookahead is semi, action is reduce stack is { stmt := . expr ; }, reduced symbol is expr, goto is state "stmt := expr . ;" stack is { stmt := expr . ;}, lookahead is semi... Line 3=>4 doesn't seem like a push sammccall: pust -> push? I thought goto replaces the top of the stack rather than being pushed onto it.
				hokeinAuthorUnsubmitted Done Reply Inline Actions yeah, this is mostly right -- stack in step 2 has two frames, rather than a combined one. The truth is: stack is [{ _ := . stmt \| stmt := . expr ; }], lookahead is IDENT, action is shift "expr := IDENT ." (push a new state to the stack) stack is [{ _ := . stmt \| stmt := . expr ; }, { expr := IDENT . }], lookahead is semi, action is reduce stack is [{ _ := . stmt \| stmt := . expr ; }], reduced symbol is expr, goto is state "stmt := expr . ;" stack is [{ _ := . stmt \| stmt := . expr ;}, { stmt := expr . ;}], lookahead is semi, action is shift "stmt := expr ; ." stack is [{ _ := . stmt \| stmt := . expr ;}, { stmt := expr . ;}, { stmt := expr ; .}], lookahead is eof, action is reduce "stmt := expr ;" stack is [{ _ := . stmt \| stmt := . expr ;}], reduced symbol is stmt, goto state is "_ := stmt ." stack is [{ _ := . stmt \| stmt := . expr ;}, { _ := stmt .}], lookahead is eof, action is accept, the parsing is done Step 3 => 4 is a push. hokein: yeah, this is mostly right -- stack in step 2 has two frames, rather than a combined one. The…
				sammccallUnsubmitted Done Reply Inline Actions parsd -> parsed sammccall: parsd -> parsed

				// Nonterminal actions, corresponding to entry of GOTO table.

				// Go to state n: push state n onto the state stack.
				sammccallUnsubmitted Done Reply Inline Actions This action struct can be squeezed to 16 bits. Kind only needs 3 bits (I suspect it can be 2: Error and Accept aren't heavily used). RuleID is 12 bits, StateID would fit within 11 and 12 should be safe. Combined with the optimization suggested below, this should get the whole table down to 335KB. sammccall: This action struct can be squeezed to 16 bits. Kind only needs 3 bits (I suspect it can be 2…
				hokeinAuthorUnsubmitted Done Reply Inline Actions squeezed to 16 bits (3 bits for Kind, and 13 bits for the Value). hokein: squeezed to 16 bits (3 bits for Kind, and 13 bits for the Value).
				// Similar to Shift, but it is a nonterminal forward transition.
				GoTo,
				};

				static Action accept(RuleID RID) { return Action(Accept, RID); }
				static Action goTo(StateID S) { return Action(GoTo, S); }
				static Action shift(StateID S) { return Action(Shift, S); }
				static Action reduce(RuleID RID) { return Action(Reduce, RID); }
				static Action sentinel() { return Action(Sentinel, 0); }

				StateID getShiftState() const {
				assert(kind() == Shift);
				return Value;
				}
				StateID getGoToState() const {
				assert(kind() == GoTo);
				return Value;
				}
				sammccallUnsubmitted Done Reply Inline Actions nit: please be consistent with `Nonterminal` (and "nonterminal") vs `NonTerminal` (and "non-terminal"). I think `Nonterminal` is used elsewhere in the code. sammccall: nit: please be consistent with `Nonterminal` (and "nonterminal") vs `NonTerminal` (and "non…
				RuleID getReduceRule() const {
				assert(kind() == Reduce);
				return Value;
				sammccallUnsubmitted Done Reply Inline Actions Why is this assertion valid? Is it a precondition of the function that some item in From can advance over NonTerminal? sammccall: Why is this assertion valid? Is it a precondition of the function that some item in From can…
				hokeinAuthorUnsubmitted Done Reply Inline Actions This is guaranteed by a DFA (a node can not have two out edges with a same label), the same reason why we don't have shift/shift conflicts. hokein: This is guaranteed by a DFA (a node can not have two out edges with a same label), the same…
				}
				sammccallUnsubmitted Done Reply Inline Actions maybe value()=>opaque() or asInteger()? Partly to give a hint a bit more at the purpose, partly to avoid confusion of `Value != value()` sammccall: maybe value()=>opaque() or asInteger()? Partly to give a hint a bit more at the purpose…
				hokeinAuthorUnsubmitted Done Reply Inline Actions renamed to `opaque`. hokein: renamed to `opaque`.
				Kind kind() const { return static_cast<Kind>(K); }

				bool operator==(const Action &L) const { return opaque() == L.opaque(); }
				uint16_t opaque() const { return K << ValueBits \| Value; };

				private:
				Action(Kind K1, unsigned Value) : K(K1), Value(Value) {}
				static constexpr unsigned ValueBits = StateBits;
				sammccallUnsubmitted Not Done Reply Inline Actions static assert RuleBits <= ValueBits, and I think we want a StateBits above plus some assertions on the number of states when building sammccall: static assert RuleBits <= ValueBits, and I think we want a StateBits above plus some assertions…
				sammccallUnsubmitted Done Reply Inline Actions FWIW I've mostly seen just `unsigned` as the field type for bitfields. It seems a little confusing to explicitly specify the size both as 16 and 13 bits. Up to you, but if you change this one also consider the Kind enum. sammccall: FWIW I've mostly seen just `unsigned` as the field type for bitfields. It seems a little…
				static constexpr unsigned KindBits = 3;
				static_assert(ValueBits >= RuleBits, "Value must be able to store RuleID");
				static_assert(KindBits + ValueBits <= 16,
				"Must be able to store kind and value efficiently");
				uint16_t K : KindBits;
				// Either StateID or RuleID, depending on the Kind.
				uint16_t Value : ValueBits;
				sammccallUnsubmitted Not Done Reply Inline Actions I think there's an even more compact representation: Basically sort the index keys as symbol-major, but then don't store the actual symbol there but rather store the range for the symbol in an external array: // index is lookahead tok::TokenKind. Values are indices into Index: the range for tok is StateIdx[TokenIdx[tok]...TokenIdx[tok+1]] vector<unsigned> TokenIdx; // index is nonterminal SymbolID. Values are indices into Index: the range for sym is StateIdx[NontermIdx[sym]...NontermIdx[sym+1]]. vector<unsigned> NontermIdx; // parallel to actions, value is the from state. Only subranges are sorted. vector<StateID> StateIdx; vector<Action> Actions; This should be 4 * (392 + 245) + (2 + 4) * 83k ~= 500k. And the one extra indirection should save 5-10 steps of binary search. This is equivalent to 2 * `vector<unsigned>` + `vector<pair<StateID, Action>>` but keeping the searched index compact is probably a good thing. sammccall: I think there's an even more compact representation: Basically sort the index keys as symbol…
				hokeinAuthorUnsubmitted Done Reply Inline Actions This looks like a nice improvement, though the implementation is a little tricky -- we reduced the binary-search space from 2 dimension (State, Symbol) to 1 dimension Symbol. hokein: This looks like a nice improvement, though the implementation is a little tricky -- we reduced…
				};

				// Returns all available actions for the given state on a terminal.
				// Expected to be called by LR parsers.
				llvm::ArrayRef<Action> getActions(StateID State, SymbolID Terminal) const;
				// Returns the state after we reduce a nonterminal.
				// Expected to be called by LR parsers.
				StateID getGoToState(StateID State, SymbolID Nonterminal) const;

				// Looks up available actions.
				// Returns empty if no available actions in the table.
				llvm::ArrayRef<Action> find(StateID State, SymbolID Symbol) const;

				size_t bytes() const {
				return sizeof(this) + Actions.capacity() sizeof(Action) +
				sammccallUnsubmitted Not Done Reply Inline Actions This is quite a lot of surface (together with the DenseMapInfo stuff) to expose. Currently it looks like it needs to be in the header so that the LRTableTest can construct a table directly rather than going through LRGraph. I think you could expose a narrower interface: struct Entry { ;... }; // Specify exact table contents for testing. static LRTable buildForTests(ArrayRef<Entry>); Then the builder/densemapinfo can be moved to the cpp file, and both `buildForTests` and `buildSLR` can use the builder internally. WDYT? sammccall: This is quite a lot of surface (together with the DenseMapInfo stuff) to expose. Currently it…
				hokeinAuthorUnsubmitted Done Reply Inline Actions Yeah, I exposed the Builder mainly for testing purposes. The new interface looks good to me. hokein: Yeah, I exposed the Builder mainly for testing purposes. The new interface looks good to me.
				States.capacity() * sizeof(StateID) +
				NontermOffset.capacity() * sizeof(uint32_t) +
				TerminalOffset.capacity() * sizeof(uint32_t);
				}

				std::string dumpStatistics() const;
				std::string dumpForTests(const Grammar &G) const;

				// Build a SLR(1) parsing table.
				static LRTable buildSLR(const Grammar &G);

				class Builder;
				// Represents an entry in the table, used for building the LRTable.
				struct Entry {
				StateID State;
				SymbolID Symbol;
				Action Act;
				};
				sammccallUnsubmitted Done Reply Inline Actions maybe "value is the offset into States/Actions where the entries for this symbol begin". sammccall: maybe "value is the offset into States/Actions where the entries for this symbol begin".
				// Build a specifid table for testing purposes.
				static LRTable buildForTests(const GrammarTable &, llvm::ArrayRef<Entry>);

				private:
				// Conceptually the LR table is a multimap from (State, SymbolID) => Action.
				// Our physical representation is quite different for compactness.

				// Index is nonterminal SymbolID, value is the offset into States/Actions
				// where the entries for this nonterminal begin.
				sammccallUnsubmitted Done Reply Inline Actions Reading the building/lookup code, "index" and "idx" are used so often they're hard to follow. I think calling this "States", parallel to "Actions", would be clearer. Similarly maybe the NontermIdx -> NontermOffset (Offset into States/Actions arrays where we find entries for this nonterminal). Then we can just use the word "index" for actual subscripts. sammccall: Reading the building/lookup code, "index" and "idx" are used so often they're hard to follow.
				// Give a non-terminal id, the corresponding half-open range of StateIdx is
				// [NontermIdx[id], NontermIdx[id+1]).
				sammccallUnsubmitted Done Reply Inline Actions concepetually -> conceptually. I think mixing this comment in with the concrete data is confusing. Maybe lift it to the top of the private section like: // Conceptually this is a multimap from (State, SymbolID) => action. // In the literature this is often a table (and multiple entries, i.e. conflicts are forbidden). // Our physical representation is quite different for compactness. sammccall: concepetually -> conceptually. I think mixing this comment in with the concrete data is…
				std::vector<uint32_t> NontermOffset;
				// Similar to NontermOffset, but for terminals, index is tok::TokenKind.
				std::vector<uint32_t> TerminalOffset;
				// Parallel to Actions, the value is State (rows of the matrix).
				// Grouped by the SymbolID, and only subranges are sorted.
				std::vector<StateID> States;
				// A flat list of available actions, sorted by (SymbolID, State).
				std::vector<Action> Actions;
				};
				llvm::raw_ostream &operator<<(llvm::raw_ostream &, const LRTable::Action &);

				} // namespace pseudo
				} // namespace syntax
				} // namespace clang

				#endif // LLVM_CLANG_TOOLING_SYNTAX_PSEUDO_LRTABLE_H

clang/lib/Tooling/Syntax/Pseudo/CMakeLists.txt

	set(LLVM_LINK_COMPONENTS Support)			set(LLVM_LINK_COMPONENTS Support)

	add_clang_library(clangToolingSyntaxPseudo			add_clang_library(clangToolingSyntaxPseudo
	Grammar.cpp			Grammar.cpp
	GrammarBNF.cpp			GrammarBNF.cpp
	LRGraph.cpp			LRGraph.cpp
				LRTable.cpp
				LRTableBuild.cpp

	LINK_LIBS			LINK_LIBS
	clangBasic			clangBasic
	clangLex			clangLex
	)			)

clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp

Show All 40 Lines	llvm::ArrayRef<Rule> Grammar::rulesFor(SymbolID SID) const {
return llvm::makeArrayRef(&T->Rules[R.start], R.end - R.start);		return llvm::makeArrayRef(&T->Rules[R.start], R.end - R.start);
}		}

const Rule &Grammar::lookupRule(RuleID RID) const {		const Rule &Grammar::lookupRule(RuleID RID) const {
assert(RID < T->Rules.size());		assert(RID < T->Rules.size());
return T->Rules[RID];		return T->Rules[RID];
}		}

llvm::StringRef Grammar::symbolName(SymbolID SID) const {		llvm::StringRef Grammar::symbolName(SymbolID SID) const {
		sammccallUnsubmitted Done Reply Inline Actions this function could use some comments. A rule S := T ... implies elements in first(S): - if T is a terminal, first(S) contains T - if T is a nonterminal, first(S) contains first(T) Since first(T) may not have been fully computed yet, first(S) itself may end up being incomplete. We iterate until a fixed point. (This isn't particularly efficient, but table building performance isn't critical). sammccall: this function could use some comments. ``` A rule S := T ... implies elements in first(S)…
if (isToken(SID))		if (isToken(SID))
return T->Terminals[symbolToToken(SID)];		return T->Terminals[symbolToToken(SID)];
return T->Nonterminals[SID].Name;		return T->Nonterminals[SID].Name;
}		}

std::string Grammar::dumpRule(RuleID RID) const {		std::string Grammar::dumpRule(RuleID RID) const {
std::string Result;		std::string Result;
llvm::raw_string_ostream OS(Result);		llvm::raw_string_ostream OS(Result);
const Rule &R = T->Rules[RID];		const Rule &R = T->Rules[RID];
OS << symbolName(R.Target) << " :=";		OS << symbolName(R.Target) << " :=";
for (SymbolID SID : R.seq())		for (SymbolID SID : R.seq())
OS << " " << symbolName(SID);		OS << " " << symbolName(SID);
		sammccallUnsubmitted Done Reply Inline Actions `bool Changed = true; while(Changed) {`? or maybe `for (bool Changed = true; Changed;) {`? (do-while is rare enough that it always takes me a bit to grasp the control flow) sammccall: `bool Changed = true; while(Changed) {`? or maybe `for (bool Changed = true; Changed;) {`? (do…
return Result;		return Result;
}		}

std::string Grammar::dumpRules(SymbolID SID) const {		std::string Grammar::dumpRules(SymbolID SID) const {
assert(isNonterminal(SID));		assert(isNonterminal(SID));
		sammccallUnsubmitted Done Reply Inline Actions comment that we only need to consider the first element because symbols are non-nullable sammccall: comment that we only need to consider the first element because symbols are non-nullable
std::string Result;		std::string Result;
const auto &Range = T->Nonterminals[SID].RuleRange;		const auto &Range = T->Nonterminals[SID].RuleRange;
for (RuleID RID = Range.start; RID < Range.end; ++RID)		for (RuleID RID = Range.start; RID < Range.end; ++RID)
Result.append(dumpRule(RID)).push_back('\n');		Result.append(dumpRule(RID)).push_back('\n');
return Result;		return Result;
}		}

std::string Grammar::dump() const {		std::string Grammar::dump() const {
Show All 15 Lines	auto ExpandFirstSet = [&FirstSets](SymbolID Target, SymbolID First) {
assert(isNonterminal(Target));		assert(isNonterminal(Target));
if (isToken(First))		if (isToken(First))
return FirstSets[Target].insert(First).second;		return FirstSets[Target].insert(First).second;
bool Changed = false;		bool Changed = false;
for (SymbolID SID : FirstSets[First])		for (SymbolID SID : FirstSets[First])
Changed \|= FirstSets[Target].insert(SID).second;		Changed \|= FirstSets[Target].insert(SID).second;
return Changed;		return Changed;
};		};

		sammccallUnsubmitted Not Done Reply Inline Actions nit: `... Y Z`, with consistent spaces/capitalization Unless case is meant to imply something here? sammccall: nit: `... Y Z`, with consistent spaces/capitalization Unless case is meant to imply something…
		hokeinAuthorUnsubmitted Done Reply Inline Actions Y was for nonterminal while z was for nonterminal or terminals. They are not super important, changed all to Uppercase. hokein: Y was for nonterminal while z was for nonterminal or terminals. They are not super important…
// A rule S := T ... implies elements in FIRST(S):		// A rule S := T ... implies elements in FIRST(S):
// - if T is a terminal, FIRST(S) contains T		// - if T is a terminal, FIRST(S) contains T
// - if T is a nonterminal, FIRST(S) contains FIRST(T)		// - if T is a nonterminal, FIRST(S) contains FIRST(T)
// Since FIRST(T) may not have been fully computed yet, FIRST(S) itself may		// Since FIRST(T) may not have been fully computed yet, FIRST(S) itself may
// end up being incomplete.		// end up being incomplete.
// We iterate until we hit a fixed point.		// We iterate until we hit a fixed point.
// (This isn't particularly efficient, but table building isn't on the		// (This isn't particularly efficient, but table building isn't on the
		sammccallUnsubmitted Done Reply Inline Actions copying the set here seems gratuitous, especially since we're doing this in a loop where !changed is the usual thing. I think inlining ExpandFollowSet for each case would make sense here. It turns into 1 line for the NT case plus two lines for the T case, so it's actually shorter overall sammccall: copying the set here seems gratuitous, especially since we're doing this in a loop where !
// critical path).		// critical path).
bool Changed = true;		bool Changed = true;
while (Changed) {		while (Changed) {
Changed = false;		Changed = false;
for (const auto &R : G.table().Rules)		for (const auto &R : G.table().Rules)
// We only need to consider the first element because symbols are		// We only need to consider the first element because symbols are
		sammccallUnsubmitted Done Reply Inline Actions I think `if (isNonterminal(Z)) ...expand...` would be clearer here, since the condition is really part of rule 3 (but in a not-totally obvious way, it's nice to see them side-by-side) sammccall: I think `if (isNonterminal(Z)) ...expand...` would be clearer here, since the condition is…
// non-nullable.		// non-nullable.
Changed \|= ExpandFirstSet(R.Target, R.seq().front());		Changed \|= ExpandFirstSet(R.Target, R.seq().front());
}		}
return FirstSets;		return FirstSets;
}		}

std::vector<llvm::DenseSet<SymbolID>> followSets(const Grammar &G) {		std::vector<llvm::DenseSet<SymbolID>> followSets(const Grammar &G) {
auto FirstSets = firstSets(G);		auto FirstSets = firstSets(G);
Show All 38 Lines	for (const auto &R : G.table().Rules) {
SymbolID Z = R.seq().back();		SymbolID Z = R.seq().back();
if (isNonterminal(Z))		if (isNonterminal(Z))
Changed \|= ExpandFollowSet(Z, FollowSets[R.Target]);		Changed \|= ExpandFollowSet(Z, FollowSets[R.Target]);
}		}
}		}
return FollowSets;		return FollowSets;
}		}

		static llvm::ArrayRef<std::string> getTerminalNames() {
		static const std::vector<std::string> *TerminalNames = []() {
		static std::vector<std::string> TerminalNames;
		TerminalNames.reserve(NumTerminals);
		for (unsigned I = 0; I < NumTerminals; ++I) {
		tok::TokenKind K = static_cast<tok::TokenKind>(I);
		if (const auto *Punc = tok::getPunctuatorSpelling(K))
		TerminalNames.push_back(Punc);
		else
		TerminalNames.push_back(llvm::StringRef(tok::getTokenName(K)).upper());
		}
		return &TerminalNames;
		}();
		return *TerminalNames;
		}
		GrammarTable::GrammarTable() : Terminals(getTerminalNames()) {}

} // namespace pseudo		} // namespace pseudo
} // namespace syntax		} // namespace syntax
} // namespace clang		} // namespace clang

clang/lib/Tooling/Syntax/Pseudo/GrammarBNF.cpp

	Show All 15 Lines
	namespace clang {			namespace clang {
	namespace syntax {			namespace syntax {
	namespace pseudo {			namespace pseudo {

	namespace {			namespace {
	static const llvm::StringRef OptSuffix = "_opt";			static const llvm::StringRef OptSuffix = "_opt";
	static const llvm::StringRef StartSymbol = "_";			static const llvm::StringRef StartSymbol = "_";

	void initTerminals(std::vector<std::string> &Out) {
	Out.clear();
	Out.reserve(NumTerminals);
	for (unsigned I = 0; I < NumTerminals; ++I) {
	tok::TokenKind K = static_cast<tok::TokenKind>(I);
	if (const auto *Punc = tok::getPunctuatorSpelling(K))
	Out.push_back(Punc);
	else
	Out.push_back(llvm::StringRef(tok::getTokenName(K)).upper());
	}
	}
	// Builds grammar from BNF files.			// Builds grammar from BNF files.
	class GrammarBuilder {			class GrammarBuilder {
	public:			public:
	GrammarBuilder(std::vector<std::string> &Diagnostics)			GrammarBuilder(std::vector<std::string> &Diagnostics)
	: Diagnostics(Diagnostics) {}			: Diagnostics(Diagnostics) {}

	std::unique_ptr<Grammar> build(llvm::StringRef BNF) {			std::unique_ptr<Grammar> build(llvm::StringRef BNF) {
	auto Specs = eliminateOptional(parse(BNF));			auto Specs = eliminateOptional(parse(BNF));

	assert(llvm::all_of(Specs,			assert(llvm::all_of(Specs,
	[](const RuleSpec &R) {			[](const RuleSpec &R) {
	if (R.Target.endswith(OptSuffix))			if (R.Target.endswith(OptSuffix))
	return false;			return false;
	return llvm::all_of(			return llvm::all_of(
	R.Sequence, [](const RuleSpec::Element &E) {			R.Sequence, [](const RuleSpec::Element &E) {
	return !E.Symbol.endswith(OptSuffix);			return !E.Symbol.endswith(OptSuffix);
	});			});
	}) &&			}) &&
	"Optional symbols should be eliminated!");			"Optional symbols should be eliminated!");

	auto T = std::make_unique<GrammarTable>();			auto T = std::make_unique<GrammarTable>();
	initTerminals(T->Terminals);

	// Assemble the name->ID and ID->nonterminal name maps.			// Assemble the name->ID and ID->nonterminal name maps.
	llvm::DenseSet<llvm::StringRef> UniqueNonterminals;			llvm::DenseSet<llvm::StringRef> UniqueNonterminals;
	llvm::DenseMap<llvm::StringRef, SymbolID> SymbolIds;			llvm::DenseMap<llvm::StringRef, SymbolID> SymbolIds;
	for (uint16_t I = 0; I < NumTerminals; ++I)			for (uint16_t I = 0; I < NumTerminals; ++I)
	SymbolIds.try_emplace(T->Terminals[I], tokenSymbol(tok::TokenKind(I)));			SymbolIds.try_emplace(T->Terminals[I], tokenSymbol(tok::TokenKind(I)));
	auto Consider = [&](llvm::StringRef Name) {			auto Consider = [&](llvm::StringRef Name) {
	if (!SymbolIds.count(Name))			if (!SymbolIds.count(Name))
	▲ Show 20 Lines • Show All 196 Lines • Show Last 20 Lines

clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp

This file was added.

				//===--- LRTable.cpp - Parsing table for LR parsers --------------- C++--===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "clang/Tooling/Syntax/Pseudo/LRTable.h"
				#include "clang/Tooling/Syntax/Pseudo/Grammar.h"
				#include "llvm/ADT/ArrayRef.h"
				#include "llvm/ADT/STLExtras.h"
				#include "llvm/Support/ErrorHandling.h"
				#include "llvm/Support/FormatVariadic.h"
				#include "llvm/Support/raw_ostream.h"

				namespace clang {
				namespace syntax {
				namespace pseudo {

				llvm::raw_ostream &operator<<(llvm::raw_ostream &OS, const LRTable::Action &A) {
				switch (A.kind()) {
				case LRTable::Action::Shift:
				return OS << llvm::formatv("shift state {0}", A.getShiftState());
				case LRTable::Action::Reduce:
				return OS << llvm::formatv("reduce by rule {0}", A.getReduceRule());
				sammccallUnsubmitted Done Reply Inline Actions as things currently stand this should be unreachable - these values should never escape the DenseMap. assert or llvm_unreachable? sammccall: as things currently stand this should be unreachable - these values should never escape the…
				case LRTable::Action::GoTo:
				return OS << llvm::formatv("go to state {0}", A.getGoToState());
				case LRTable::Action::Accept:
				sammccallUnsubmitted Done Reply Inline Actions nit: goTo -> go to, like dumpForTests? sammccall: nit: goTo -> go to, like dumpForTests?
				return OS << "acc";
				case LRTable::Action::Sentinel:
				llvm_unreachable("unexpected Sentinel action kind!");
				}
				}

				std::string LRTable::dumpStatistics() const {
				StateID NumOfStates = 0;
				for (StateID It : States)
				NumOfStates = std::max(It, NumOfStates);
				return llvm::formatv(R"(
				Statistics of the LR parsing table:
				number of states: {0}
				number of actions: {1}
				size of the table (bytes): {2}
				)",
				NumOfStates, Actions.size(), bytes())
				.str();
				}

				std::string LRTable::dumpForTests(const Grammar &G) const {
				std::string Result;
				llvm::raw_string_ostream OS(Result);
				StateID MaxState = 0;
				for (StateID It : States)
				MaxState = std::max(MaxState, It);
				OS << "LRTable:\n";
				for (StateID S = 0; S <= MaxState; ++S) {
				OS << llvm::formatv("State {0}\n", S);
				for (uint16_t Terminal = 0; Terminal < NumTerminals; ++Terminal) {
				SymbolID TokID = tokenSymbol(static_cast<tok::TokenKind>(Terminal));
				for (auto A : find(S, TokID)) {
				if (A.kind() == LRTable::Action::Shift)
				OS.indent(4) << llvm::formatv("'{0}': shift state {1}\n",
				G.symbolName(TokID), A.getShiftState());
				else if (A.kind() == LRTable::Action::Reduce)
				OS.indent(4) << llvm::formatv("'{0}': reduce by rule {1} '{2}'\n",
				G.symbolName(TokID), A.getReduceRule(),
				G.dumpRule(A.getReduceRule()));
				else if (A.kind() == LRTable::Action::Accept)
				OS.indent(4) << llvm::formatv("'{0}': accept\n", G.symbolName(TokID));
				}
				}
				for (SymbolID NontermID = 0; NontermID < G.table().Nonterminals.size();
				++NontermID) {
				if (find(S, NontermID).empty())
				continue;
				OS.indent(4) << llvm::formatv("'{0}': go to state {1}\n",
				G.symbolName(NontermID),
				getGoToState(S, NontermID));
				}
				}
				return OS.str();
				}

				llvm::ArrayRef<LRTable::Action> LRTable::getActions(StateID State,
				SymbolID Terminal) const {
				assert(pseudo::isToken(Terminal) && "expect terminal symbol!");
				return find(State, Terminal);
				}

				LRTable::StateID LRTable::getGoToState(StateID State,
				SymbolID Nonterminal) const {
				assert(pseudo::isNonterminal(Nonterminal) && "expected nonterminal symbol!");
				auto Result = find(State, Nonterminal);
				assert(Result.size() == 1 && Result.front().kind() == Action::GoTo);
				return Result.front().getGoToState();
				}

				sammccallUnsubmitted Done Reply Inline Actions (moving the comment thread to the new code location) The DFA input guarantees Result.size() <= 1, but why can't it be zero? If this is a requirement on the caller, mention it in the header? sammccall: (moving the comment thread to the new code location) The DFA input guarantees Result.size() <=…
				hokeinAuthorUnsubmitted Done Reply Inline Actions yeah, getGoToState and getActions are expected to be called by the GLR parser, the parser itself should guarantee it during parsing. Added comments. hokein: yeah, getGoToState and getActions are expected to be called by the GLR parser, the parser…
				llvm::ArrayRef<LRTable::Action> LRTable::find(StateID Src, SymbolID ID) const {
				size_t Idx = isToken(ID) ? symbolToToken(ID) : ID;
				assert(isToken(ID) ? Idx + 1 < TerminalOffset.size()
				: Idx + 1 < NontermOffset.size());
				std::pair<size_t, size_t> TargetStateRange =
				isToken(ID) ? std::make_pair(TerminalOffset[Idx], TerminalOffset[Idx + 1])
				: std::make_pair(NontermOffset[Idx], NontermOffset[Idx + 1]);
				auto TargetedStates =
				llvm::makeArrayRef(States.data() + TargetStateRange.first,
				States.data() + TargetStateRange.second);

				assert(llvm::is_sorted(TargetedStates) &&
				"subrange of the StateIdx should be sorted!");
				const LRTable::StateID *It = llvm::partition_point(
				TargetedStates, [&Src](LRTable::StateID S) { return S < Src; });
				if (It == TargetedStates.end())
				return {};
				size_t Start = It - States.data(), End = Start;
				while (End < States.size() && States[End] == Src)
				++End;
				return llvm::makeArrayRef(&Actions[Start], &Actions[End]);
				aaron.ballmanUnsubmitted Not Done Reply Inline Actions This is causing an assertion with debug builds on Windows because `Actions[End]` is out of bounds (so the MSVC STL debug assertions catch the issue) for the test cases in this patch. aaron.ballman: This is causing an assertion with debug builds on Windows because `Actions[End]` is out of…
				hokeinAuthorUnsubmitted Done Reply Inline Actions this should be `llvm::makeArrayRef(&Actions[Start], End - Start)`. Fixed in 302ca279cb83043ef7d60115eb5ba58f12064a4a. hokein: this should be `llvm::makeArrayRef(&Actions[Start], End - Start)`. Fixed in…
				aaron.ballmanUnsubmitted Not Done Reply Inline Actions I can confirm that the issue is now fixed for me, thank you! aaron.ballman: I can confirm that the issue is now fixed for me, thank you!
				}

				} // namespace pseudo
				} // namespace syntax
				} // namespace clang
				sammccallUnsubmitted Done Reply Inline Actions nit: "assign" seems clearer than "resize" sammccall: nit: "assign" seems clearer than "resize"
				sammccallUnsubmitted Done Reply Inline Actions We end up with holes because we're looping over the values rather than the keys. This also feels quite indirect. Seems clear enough to reverse this, something like: unsigned Pos = 0; for (unsigned NT = 0; TK < GT.Nonterminals.size(); ++I) { NontermIdx[NT] = Pos; while (Pos < Sorted.size() && Sorted[Pos].Action == NT) ++Pos; } NontermIdx.back() = Pos; // and the same for terminals sammccall: We end up with holes because we're looping over the values rather than the keys. This also…
				hokeinAuthorUnsubmitted Done Reply Inline Actions yeah, this is much better! hokein: yeah, this is much better!

clang/lib/Tooling/Syntax/Pseudo/LRTableBuild.cpp

This file was added.

				//===--- LRTableBuild.cpp - Build a LRTable from LRGraph ---------- C++--===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "clang/Basic/TokenKinds.h"
				#include "clang/Tooling/Syntax/Pseudo/Grammar.h"
				#include "clang/Tooling/Syntax/Pseudo/LRGraph.h"
				#include "clang/Tooling/Syntax/Pseudo/LRTable.h"
				#include <cstdint>

				namespace llvm {
				template <> struct DenseMapInfo<clang::syntax::pseudo::LRTable::Entry> {
				using Entry = clang::syntax::pseudo::LRTable::Entry;
				static inline Entry getEmptyKey() {
				static Entry E{static_cast<clang::syntax::pseudo::SymbolID>(-1), 0,
				clang::syntax::pseudo::LRTable::Action::sentinel()};
				sammccallUnsubmitted Not Done Reply Inline Actions (I think it would be worth moving this into LRTable in order to hide the Builder type from the header...) sammccall: (I think it would be worth moving this into LRTable in order to hide the Builder type from the…
				hokeinAuthorUnsubmitted Done Reply Inline Actions I tend to move all build bits (Builder, BuildSLRTable, BuildFor tests) to this file rather than `LRTable.cpp`. hokein: I tend to move all build bits (Builder, BuildSLRTable, BuildFor tests) to this file rather than…
				return E;
				}
				static inline Entry getTombstoneKey() {
				static Entry E{static_cast<clang::syntax::pseudo::SymbolID>(-2), 0,
				clang::syntax::pseudo::LRTable::Action::sentinel()};
				return E;
				}
				static unsigned getHashValue(const Entry &I) {
				return llvm::hash_combine(I.State, I.Symbol, I.Act.opaque());
				}
				static bool isEqual(const Entry &LHS, const Entry &RHS) {
				return LHS.State == RHS.State && LHS.Symbol == RHS.Symbol &&
				sammccallUnsubmitted Not Done Reply Inline Actions (rephrasing a comment that got lost earlier in the review) I'm not totally clear on the scope/purpose of the accept action: if it's meant to be sufficient to know whether the parse succeeded, I think it needs the symbol we accepted as a payload. The grammar will accept them all, but the caller will place restrictions. if we're OK inspecting the stack and seeing our last reduction was a Decl or so, why doens't the parser just do that at the end of the stream and do away with `Accept` altogether? (This would avoid the need to splice `eof` tokens into token streams to parse subranges of them). sammccall: (rephrasing a comment that got lost earlier in the review) I'm not totally clear on the…
				hokeinAuthorUnsubmitted Done Reply Inline Actions The accept action is following what is described in the standard LR literature. Yeah, the main purpose of the accept action is to tell us whether a parse succeeded. We could add a ruleID as a payload for accept action, but I'm not sure whether it would be useful, the associated rule is about `_` (e.g. `_ := translation_unit`) -- we are interested in `translation_unit`, this can be retrieved from the forest. I think the both options are mostly equivalent (the 1st option is a standard LR implementation, the 2rd one seems more add-hoc)-- we can treat accept action as a "special" reduce action of the rule "_ := translation_unit" except that we don't do the reduce, we just stop parsing. I think we probably will revisit later -- things become tricker, if we start supporting snippets (the start symbol `_` will have multiple rules), e.g. for the follow grammar, _ := stmt _ := expr stmt := expr expr := ID the input "ID" can be parsed as a "stmt" or an "expr", if we use a unified state `{ _ := . stmt \| _ := . expr }` to start, we will have a new-type conflict (accept/reduce). I don't think we want to handle this kind of new conflicts. There are some options: careful manage the `_` rules, to make sure no such conflicts happened; introduce a new augmented grammar `__ := _` (the accept/reduce would become reduce/reduce conflict); use separate start state per `_` rule, and callers need to pass a targeting non-terminal to the parser; Since there are some uncertain bits here, I'd prefer not doing further changes in this patch (but happy to add a ruleID payload for accept action). WDYT? hokein: The accept action is following what is described in the standard LR literature. Yeah, the main…
				LHS.Act == RHS.Act;
				}
				};
				} // namespace llvm

				namespace clang {
				namespace syntax {
				namespace pseudo {

				class LRTable::Builder {
				public:
				bool insert(Entry E) { return Entries.insert(std::move(E)).second; }
				LRTable build(const GrammarTable &GT) && {
				// E.g. given the following parsing table with 3 states and 3 terminals:
				//
				// a b c
				// +-------+----+-------+-+
				// \|state0 \| \| s0,r0 \| \|
				// \|state1 \| acc\| \| \|
				// \|state2 \| \| r1 \| \|
				// +-------+----+-------+-+
				//
				// The final LRTable:
				// - TerminalOffset: [a] = 0, [b] = 1, [c] = 4, [d] = 4 (d is a sentinel)
				// - States: [ 1, 0, 0, 2]
				// Actions: [ acc, s0, r0, r1]
				// ~~~ corresponding range for terminal a
				// ~~~~~~~~~~ corresponding range for terminal b
				// First step, we sort all entries by (Symbol, State, Action).
				std::vector<Entry> Sorted(Entries.begin(), Entries.end());
				llvm::sort(Sorted, [](const Entry &L, const Entry &R) {
				return std::forward_as_tuple(L.Symbol, L.State, L.Act.opaque()) <
				std::forward_as_tuple(R.Symbol, R.State, R.Act.opaque());
				});

				LRTable Table;
				Table.Actions.reserve(Sorted.size());
				Table.States.reserve(Sorted.size());
				// We are good to finalize the States and Actions.
				for (const auto &E : Sorted) {
				Table.Actions.push_back(E.Act);
				Table.States.push_back(E.State);
				}
				// Initialize the terminal and nonterminal idx, all ranges are empty by
				// default.
				Table.TerminalOffset = std::vector<uint32_t>(GT.Terminals.size() + 1, 0);
				Table.NontermOffset = std::vector<uint32_t>(GT.Nonterminals.size() + 1, 0);
				size_t SortedIndex = 0;
				for (SymbolID NonterminalID = 0; NonterminalID < Table.NontermOffset.size();
				++NonterminalID) {
				Table.NontermOffset[NonterminalID] = SortedIndex;
				while (SortedIndex < Sorted.size() &&
				Sorted[SortedIndex].Symbol == NonterminalID)
				++SortedIndex;
				}
				for (size_t Terminal = 0; Terminal < Table.TerminalOffset.size();
				++Terminal) {
				Table.TerminalOffset[Terminal] = SortedIndex;
				while (SortedIndex < Sorted.size() &&
				Sorted[SortedIndex].Symbol ==
				tokenSymbol(static_cast<tok::TokenKind>(Terminal)))
				++SortedIndex;
				}
				return Table;
				}

				private:
				llvm::DenseSet<Entry> Entries;
				};

				LRTable LRTable::buildForTests(const GrammarTable &GT,
				llvm::ArrayRef<Entry> Entries) {
				Builder Build;
				for (const Entry &E : Entries)
				Build.insert(E);
				return std::move(Build).build(GT);
				}

				LRTable LRTable::buildSLR(const Grammar &G) {
				Builder Build;
				auto Graph = LRGraph::buildLR0(G);
				for (const auto &T : Graph.edges()) {
				Action Act = isToken(T.Label) ? Action::shift(T.Dst) : Action::goTo(T.Dst);
				Build.insert({T.Src, T.Label, Act});
				}
				assert(Graph.states().size() <= (1 << StateBits) &&
				"Graph states execceds the maximum limit!");
				auto FollowSets = followSets(G);
				for (StateID SID = 0; SID < Graph.states().size(); ++SID) {
				for (const Item &I : Graph.states()[SID].Items) {
				// If we've just parsed the start symbol, we can accept the input.
				if (G.lookupRule(I.rule()).Target == G.startSymbol() && !I.hasNext()) {
				Build.insert({SID, tokenSymbol(tok::eof), Action::accept(I.rule())});
				continue;
				}
				if (!I.hasNext()) {
				// If we've reached the end of a rule A := ..., then we can reduce if
				// the next token is in the follow set of A".
				for (SymbolID Follow : FollowSets[G.lookupRule(I.rule()).Target]) {
				assert(isToken(Follow));
				Build.insert({SID, Follow, Action::reduce(I.rule())});
				}
				}
				}
				}
				return std::move(Build).build(G.table());
				}

				} // namespace pseudo
				} // namespace syntax
				} // namespace clang

clang/test/Syntax/check-cxx-bnf.test

	// verify clang/lib/Tooling/Syntax/Pseudo/cxx.bnf			// verify clang/lib/Tooling/Syntax/Pseudo/cxx.bnf
	// RUN: clang-pseudo -check-grammar=%cxx-bnf-file			// RUN: clang-pseudo -grammar=%cxx-bnf-file

clang/test/Syntax/lr-build-basic.test

This file was added.

				_ := expr
				expr := IDENTIFIER

				# RUN: clang-pseudo -grammar %s -print-graph \| FileCheck %s --check-prefix=GRAPH
				# GRAPH: States:
				# GRPAH-NEXT: State 0
				# GRPAH-NEXT: _ := • expr
				# GRPAH-NEXT: expr := • IDENTIFIER
				# GRPAH-NEXT: State 1
				# GRPAH-NEXT: _ := expr •
				# GRPAH-NEXT: State 2
				# GRPAH-NEXT: expr := IDENTIFIER •
				# GRPAH-NEXT: 0 ->[expr] 1
				# GRPAH-NEXT: 0 ->[IDENTIFIER] 2

				# RUN: clang-pseudo -grammar %s -print-table \| FileCheck %s --check-prefix=TABLE
				# TABLE: LRTable:
				# TABLE-NEXT: State 0
				# TABLE-NEXT: 'IDENTIFIER': shift state 2
				# TABLE-NEXT: 'expr': go to state 1
				# TABLE-NEXT: State 1
				# TABLE-NEXT: 'EOF': accept
				# TABLE-NEXT: State 2
				# TABLE-NEXT: 'EOF': reduce by rule 1 'expr := IDENTIFIER'

clang/test/Syntax/lr-build-conflicts.test

This file was added.

				_ := expr
				expr := expr - expr # S/R conflict at state 4 on '-' token
				expr := IDENTIFIER

				# RUN: clang-pseudo -grammar %s -print-graph \| FileCheck %s --check-prefix=GRAPH
				# GRAPH: States
				# GRAPH-NEXT: State 0
				# GRAPH-NEXT: _ := • expr
				# GRAPH-NEXT: expr := • expr - expr
				# GRAPH-NEXT: expr := • IDENTIFIER
				# GRAPH-NEXT: State 1
				# GRAPH-NEXT: _ := expr •
				# GRAPH-NEXT: expr := expr • - expr
				# GRAPH-NEXT: State 2
				# GRAPH-NEXT: expr := IDENTIFIER •
				# GRAPH-NEXT: State 3
				# GRAPH-NEXT: expr := • expr - expr
				# GRAPH-NEXT: expr := expr - • expr
				# GRAPH-NEXT: expr := • IDENTIFIER
				# GRAPH-NEXT: State 4
				# GRAPH-NEXT: expr := expr - expr •
				# GRAPH-NEXT: expr := expr • - expr
				# GRAPH-NEXT: 0 ->[expr] 1
				# GRAPH-NEXT: 0 ->[IDENTIFIER] 2
				# GRAPH-NEXT: 1 ->[-] 3
				# GRAPH-NEXT: 3 ->[expr] 4
				# GRAPH-NEXT: 3 ->[IDENTIFIER] 2
				# GRAPH-NEXT: 4 ->[-] 3

				# RUN: clang-pseudo -grammar %s -print-table \| FileCheck %s --check-prefix=TABLE
				# TABLE: LRTable:
				# TABLE-NEXT: State 0
				# TABLE-NEXT: 'IDENTIFIER': shift state 2
				# TABLE-NEXT: 'expr': go to state 1
				# TABLE-NEXT: State 1
				# TABLE-NEXT: 'EOF': accept
				# TABLE-NEXT: '-': shift state 3
				# TABLE-NEXT: State 2
				# TABLE-NEXT: 'EOF': reduce by rule 1 'expr := IDENTIFIER'
				# TABLE-NEXT: '-': reduce by rule 1 'expr := IDENTIFIER'
				# TABLE-NEXT: State 3
				# TABLE-NEXT: 'IDENTIFIER': shift state 2
				# TABLE-NEXT: 'expr': go to state 4
				# TABLE-NEXT: State 4
				# TABLE-NEXT: 'EOF': reduce by rule 2 'expr := expr - expr'
				# TABLE-NEXT: '-': shift state 3
				# TABLE-NEXT: '-': reduce by rule 2 'expr := expr - expr'

clang/tools/clang-pseudo/ClangPseudo.cpp

	//===-- ClangPseudo.cpp - Clang pseudo parser tool ------------------------===//			//===-- ClangPseudo.cpp - Clang pseudo parser tool ------------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "clang/Tooling/Syntax/Pseudo/Grammar.h"			#include "clang/Tooling/Syntax/Pseudo/Grammar.h"
				#include "clang/Tooling/Syntax/Pseudo/LRGraph.h"
				#include "clang/Tooling/Syntax/Pseudo/LRTable.h"
	#include "llvm/ADT/StringExtras.h"			#include "llvm/ADT/StringExtras.h"
	#include "llvm/Support/CommandLine.h"			#include "llvm/Support/CommandLine.h"
	#include "llvm/Support/FormatVariadic.h"			#include "llvm/Support/FormatVariadic.h"
	#include "llvm/Support/MemoryBuffer.h"			#include "llvm/Support/MemoryBuffer.h"

	using clang::syntax::pseudo::Grammar;			using clang::syntax::pseudo::Grammar;
	using llvm::cl::desc;			using llvm::cl::desc;
	using llvm::cl::init;			using llvm::cl::init;
	using llvm::cl::opt;			using llvm::cl::opt;

	static opt<std::string>			static opt<std::string>
	CheckGrammar("check-grammar", desc("Parse and check a BNF grammar file."),			Grammar("grammar", desc("Parse and check a BNF grammar file."), init(""));
	init(""));			static opt<bool> PrintGraph("print-graph",
				desc("Print the LR graph for the grammar"));
				static opt<bool> PrintTable("print-table",
				desc("Print the LR table for the grammar"));

	int main(int argc, char *argv[]) {			static std::string readOrDie(llvm::StringRef Path) {
	llvm::cl::ParseCommandLineOptions(argc, argv, "");

	if (CheckGrammar.getNumOccurrences()) {
	llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> Text =			llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> Text =
	llvm::MemoryBuffer::getFile(CheckGrammar);			llvm::MemoryBuffer::getFile(Path);
	if (std::error_code EC = Text.getError()) {			if (std::error_code EC = Text.getError()) {
	llvm::errs() << "Error: can't read grammar file '" << CheckGrammar			llvm::errs() << "Error: can't read file '" << Path << "': " << EC.message()
	<< "': " << EC.message() << "\n";			<< "\n";
	return 1;			::exit(1);
	}			}
				return Text.get()->getBuffer().str();
				}

				int main(int argc, char *argv[]) {
				llvm::cl::ParseCommandLineOptions(argc, argv, "");

				if (Grammar.getNumOccurrences()) {
				std::string Text = readOrDie(Grammar);
	std::vector<std::string> Diags;			std::vector<std::string> Diags;
	auto RSpecs = Grammar::parseBNF(Text.get()->getBuffer(), Diags);			auto G = Grammar::parseBNF(Text, Diags);

	if (!Diags.empty()) {			if (!Diags.empty()) {
	llvm::errs() << llvm::join(Diags, "\n");			llvm::errs() << llvm::join(Diags, "\n");
	return 2;			return 2;
	}			}
	llvm::errs() << llvm::formatv("grammar file {0} is parsed successfully\n",			llvm::outs() << llvm::formatv("grammar file {0} is parsed successfully\n",
				sammccallUnsubmitted Not Done Reply Inline Actions I think that we should also have flags to: dump the grammar dump the LR table sammccall: I think that we should also have flags to: - dump the grammar - dump the LR table
	CheckGrammar);			Grammar);
				if (PrintGraph)
				llvm::outs() << clang::syntax::pseudo::LRGraph::buildLR0(*G).dumpForTests(
				*G);
				if (PrintTable)
				llvm::outs() << clang::syntax::pseudo::LRTable::buildSLR(*G).dumpForTests(
				*G);
	return 0;			return 0;
	}			}

	return 0;			return 0;
	}			}

clang/unittests/Tooling/Syntax/Pseudo/CMakeLists.txt

	set(LLVM_LINK_COMPONENTS			set(LLVM_LINK_COMPONENTS
	Support			Support
	)			)

	add_clang_unittest(ClangPseudoTests			add_clang_unittest(ClangPseudoTests
	GrammarTest.cpp			GrammarTest.cpp
	LRGraphTest.cpp			LRTableTest.cpp
	)			)

	clang_target_link_libraries(ClangPseudoTests			clang_target_link_libraries(ClangPseudoTests
	PRIVATE			PRIVATE
	clangBasic			clangBasic
	clangLex			clangLex
	clangToolingSyntaxPseudo			clangToolingSyntaxPseudo
	clangTesting			clangTesting
	)			)

	target_link_libraries(ClangPseudoTests			target_link_libraries(ClangPseudoTests
	PRIVATE			PRIVATE
	LLVMTestingSupport			LLVMTestingSupport
	)			)

clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp

This file was deleted.

	//===--- LRGraphTest.cpp - LRGraph tests -------------------------- C++--===//
	//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//
	//===----------------------------------------------------------------------===//

	#include "clang/Tooling/Syntax/Pseudo/LRGraph.h"
	#include "gmock/gmock.h"
	#include "gtest/gtest.h"
	#include <memory>

	namespace clang {
	namespace syntax {
	namespace pseudo {
	namespace {

	TEST(LRGraph, Build) {
	struct TestCase {
	llvm::StringRef BNF;
	llvm::StringRef ExpectedStates;
	};

	TestCase Cases[] = {{
	R"bnf(
	_ := expr
	expr := IDENTIFIER
	)bnf",
	R"(States:
	State 0
	_ := • expr
	expr := • IDENTIFIER
	State 1
	_ := expr •
	State 2
	expr := IDENTIFIER •
	0 ->[expr] 1
	0 ->[IDENTIFIER] 2
	)"},
	{// A grammar with a S/R conflict in SLR table:
	// (id-id)-id, or id-(id-id).
	R"bnf(
	_ := expr
	expr := expr - expr # S/R conflict at state 4 on '-' token
	expr := IDENTIFIER
	)bnf",
	R"(States:
	State 0
	_ := • expr
	expr := • expr - expr
	expr := • IDENTIFIER
	State 1
	_ := expr •
	expr := expr • - expr
	State 2
	expr := IDENTIFIER •
	State 3
	expr := • expr - expr
	expr := expr - • expr
	expr := • IDENTIFIER
	State 4
	expr := expr - expr •
	expr := expr • - expr
	0 ->[expr] 1
	0 ->[IDENTIFIER] 2
	1 ->[-] 3
	3 ->[expr] 4
	3 ->[IDENTIFIER] 2
	4 ->[-] 3
	)"}};
	for (const auto &C : Cases) {
	std::vector<std::string> Diags;
	auto G = Grammar::parseBNF(C.BNF, Diags);
	ASSERT_THAT(Diags, testing::IsEmpty());
	auto LR0 = LRGraph::buildLR0(*G);
	EXPECT_EQ(LR0.dumpForTests(*G), C.ExpectedStates);
	}
	}

	} // namespace
	} // namespace pseudo
	} // namespace syntax
	} // namespace clang

clang/unittests/Tooling/Syntax/Pseudo/LRTableTest.cpp

This file was added.

				//===--- LRTableTest.cpp - ---------------------------------------- C++--===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "clang/Tooling/Syntax/Pseudo/LRTable.h"
				#include "clang/Basic/TokenKinds.h"
				#include "clang/Tooling/Syntax/Pseudo/Grammar.h"
				#include "gmock/gmock.h"
				#include "gtest/gtest.h"
				#include <vector>

				namespace clang {
				namespace syntax {
				namespace pseudo {
				namespace {

				using testing::IsEmpty;
				using testing::UnorderedElementsAre;
				using Action = LRTable::Action;

				TEST(LRTable, Builder) {
				GrammarTable GTable;

				// eof semi ...
				// +-------+----+-------+---
				// \|state0 \| \| s0,r0 \|...
				// \|state1 \| acc\| \|...
				// \|state2 \| \| r1 \|...
				// +-------+----+-------+---
				std::vector<LRTable::Entry> Entries = {
				{/* State */ 0, tokenSymbol(tok::semi), Action::shift(0)},
				{/* State */ 0, tokenSymbol(tok::semi), Action::reduce(0)},
				{/* State */ 1, tokenSymbol(tok::eof), Action::accept(2)},
				{/* State */ 2, tokenSymbol(tok::semi), Action::reduce(1)}};
				GrammarTable GT;
				LRTable T = LRTable::buildForTests(GT, Entries);
				EXPECT_THAT(T.find(0, tokenSymbol(tok::eof)), IsEmpty());
				EXPECT_THAT(T.find(0, tokenSymbol(tok::semi)),
				UnorderedElementsAre(Action::shift(0), Action::reduce(0)));
				EXPECT_THAT(T.find(1, tokenSymbol(tok::eof)),
				UnorderedElementsAre(Action::accept(2)));
				EXPECT_THAT(T.find(1, tokenSymbol(tok::semi)), IsEmpty());
				EXPECT_THAT(T.find(2, tokenSymbol(tok::semi)),
				UnorderedElementsAre(Action::reduce(1)));
				// Verify the behaivor for other non-available-actions terminals.
				EXPECT_THAT(T.find(2, tokenSymbol(tok::kw_int)), IsEmpty());
				}

				} // namespace
				} // namespace pseudo
				} // namespace syntax
				} // namespace clang

This is an archive of the discontinued LLVM Phabricator instance.

[syntax][pseudo] Implement LR parsing table.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 410729

clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h

clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h

clang/lib/Tooling/Syntax/Pseudo/CMakeLists.txt

clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp

clang/lib/Tooling/Syntax/Pseudo/GrammarBNF.cpp

clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp

clang/lib/Tooling/Syntax/Pseudo/LRTableBuild.cpp

clang/test/Syntax/check-cxx-bnf.test

clang/test/Syntax/lr-build-basic.test

clang/test/Syntax/lr-build-conflicts.test

clang/tools/clang-pseudo/ClangPseudo.cpp

clang/unittests/Tooling/Syntax/Pseudo/CMakeLists.txt

clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp

clang/unittests/Tooling/Syntax/Pseudo/LRTableTest.cpp

[syntax][pseudo] Implement LR parsing table.
ClosedPublic