Page MenuHomePhabricator

[pseudo] Implement LRGraph

Authored by hokein on Feb 7 2022, 11:36 AM.



LRGraph is the key component of the clang pseudo parser, it is a
deterministic handle-finding finite-state machine, which is used to
generated the LR parsing table.

Separate from

Diff Detail

Event Timeline

hokein created this revision.Feb 7 2022, 11:36 AM
hokein requested review of this revision.Feb 7 2022, 11:36 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 7 2022, 11:36 AM
sammccall accepted this revision.Feb 8 2022, 2:01 AM

LG, thank you!
Bunch of nits, up to you what you want to do with them.


I think it'd be useful to be a little more concrete here than "find"...

and collect the right-hand side of a production rule (called handle) on top
of the stack, then replace (reduce) the handle with the nonterminal defined
by the production rule.

The lack of spaces on the RHS is inconsistent with the debug output, and any examples with real grammars (since we need spaces to split up words).

Can we use A := x y instead?
Or possibly A := X Y which makes the boundaries more obvious when it appears in-line with text.

(I don't have an opinion about A := X . Y vs A := X.Y (dot *replaces* space), but again we should pick one consistent with the debug output.


I think the calls to these constructors could be a little clearer as factories that describe their purpose:


static Item start(RuleID ID, const Grammar &G);
static Item sentinel(uint8_t ID);
Item Item::advance() const;

I think that would cover everything?


this alias is used only once in the header, for Items - it doesn't really introduce a new public concept that's distinct from that field.

I think it would be clearer to write it as std::vector<Item> there, and move the alias to the implementation file.


call this "SortByNextSymbol"? or something?

As it is it's hard at a glance to understand why we have operator< and LessItem


This comparator is pure implementation detail, and can go in the cpp file.
(Still fine to reference in the comment, but I think "in a canonical order" is enough)


this seems to compare items equal if both are dot-at-the-end.

In general the cascade/priority/completeness of comparisons isn't really easy to spot. Consider structuring like:

if (L.hasNext() && R.hasNext() && R.Lnext(G) !=
  return <;
if (L.hasNext() != R.hasNext())
  return L.hasNext() < R.hasNext(); // trailing dot is minimum
return L < R;

IIUC the data structure used here is inherently LR0, rather than the choice of how we build it.
(Vs in the LRTable where we could come up with data representing a full LR(1) parser without changing the structure).

So I think this function doesn't need LR0 in its name (but the class could be LR0Graph if you like)?


This took me a while to understand: you're reusing the ItemSet (passed by value) as the work queue (well, stack) and it's already initialized to the right elements.

This is clever, but please add a comment. Maybe even rename IS->Queue


nit: auto -> Item?
(Generally if the types aren't hard to write or read, it seems nice to include them)


nit: I find the spacing (blank lines) here a bit confusing: this seems part of the previous "paragraph" rather than the following one.

Could use a comment like "Is there a nonterminal next() to expand?"


You still have the IS array sitting there, you could reuse that again and avoid the allocation if you didn't expand the set too much :-)


Maybe call this Item.advance(), to avoid exposing the constructor


Since this is a single-line output, it seems more flexible to put the caller in charge of adding a newline if they want one?

(This would be inconsistent with State::dump of course, but I think that's OK. Maybe we should work out a naming convention at some point)


Hardcoding the indentation seems a bit special-purpose. Make the indent level an optional parameter to State::dump() and use OS.indent()?


nit: more spaces would make this more readable

{0} ->[{1}] {2}
or even
{0} --[{1}]-> {2}?


This really feels like a dubious optimization to me as it's *incorrect* if we ever get a hash collision, and the kernel ItemSet is much smaller than States which we're already storing. This seems like it must only save 20% or something of overall size?

In the other review thread you compared the size of DenseMap<hash_code, size_t> vs DenseMap<ItemSet, size_t>, but really the total size including States seems like a more useful comparison.

This revision is now accepted and ready to land.Feb 8 2022, 2:01 AM
hokein updated this revision to Diff 407092.Feb 9 2022, 1:58 AM
hokein marked 16 inline comments as done.

address review comments.


At the moment we build an LR0, but the LRGraph itself is a generic data-structure, which could be used to model the LR(1) automaton (it is a matter how we define the states/nodes in the graph). Let's say in the future, we want to build a full LR(1), then we could easily add another method LRGraph::buildLR(1), so the current names seem more reasonable to me.


ok, changed to DenseMap<ItemSet, size_t> instead. The size of States is ~215KB, so 15% DenseMap<hash_code> vs 28% for DenseMap<ItemSet>, not huge.

This revision was landed with ongoing or failed builds.Feb 9 2022, 2:20 AM
This revision was automatically updated to reflect the committed changes.
alextsao1999 added inline comments.

Can we add LookaheadSymbol here to implement LR(1)?

Herald added a project: Restricted Project. · View Herald TranscriptMar 5 2022, 2:21 AM
hokein added inline comments.Mar 7 2022, 1:10 AM

we could do that. However, we don't have a plan to implement an LR(1) yet, we use SLR(1). (though LR(1) is more powerful than SLR(1), the typical deterministic LR(1) parser cannot handle the C++ grammar, we need a "general" parser GLR which can be able to handle arbitrary context-free grammars).

alextsao1999 added inline comments.Mar 7 2022, 5:48 AM

Thanks for your answering! Oh, I know some GLR parsers are based on LR(1) or LALR, so I think our GLR parser is based on LR(1) as well. I'm trying to keep up with your train of thought :)

sammccall added inline comments.Mar 7 2022, 6:49 AM

Yeah, GLR changes the tradeoff between more sophisticated and simpler parsers (LR(1) > LALR > SLR(1) > LR(0)).

Normally the sophisticated parsers are able to handle grammars/languages that the simple ones can't, by avoiding action conflicts. So the value is very high.

However with GLR we can handle action conflicts by branching, so the value is "only" avoiding the performance hit of chasing branches that don't go anywhere.

So it didn't really seem worth the extra implementation complexity (or extra size of the in-memory grammar tables!) to use a more powerful parser than SLR(1). Maybe we should even have given LR(0) more thought :-)

alextsao1999 added inline comments.Mar 7 2022, 9:37 AM

Yes, agree. SLR can reduce memory usage, but it can't handle operator precedence. With the help of GLR, we can resolve the problem of operator precedence by chosing one branch. Thanks, I got it!