Page MenuHomePhabricator

[pseudo][WIP] Build Ambiguous forest node in the GLR Parser.
Needs ReviewPublic

Authored by hokein on Mar 10 2022, 3:07 AM.



Forest node by design is unmutable. To create an ambiguous node, we have
to know all alternatives in advance.

In order to achieve that, we must perform all reductions in a careful
order (see code comments for details), so that we can gather completed
alternatives as a batch, and process them in a single pass.

E.g. considering the following grammar:

TU := stmt
TU := expr
stmt := expr
expr := ID
stmt := ID

// Ambiguous stmt forest node:
//     stmt (ambiguous)
//    /     \
//   /      stmt
//   |       |
//  stmt   expr
//    \     /
//       ID

The ambiguous Stmt node is built in a single section where we perform three reductions:

  1. expr := ID
  2. stmt := ID
  3. stmt := expr (enabled after 1 is performed)

We expect to perform them in a batch way with the order {1}, {2, 3}.
When processing the batch {2, 3} where ambiguity happens, we build an
ambiguous node for stmt.

Based on

Diff Detail

Event Timeline

hokein created this revision.Mar 10 2022, 3:07 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 10 2022, 3:07 AM
hokein requested review of this revision.Mar 10 2022, 3:07 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 10 2022, 3:07 AM

The implementation (performReduction) is awkward at the moment, but it is completed, should give a global view of the algorithm.

alextsao1999 added inline comments.Mar 14 2022, 11:18 AM

Maybe we can make goto more clear? like performGoto after every GLR reduction.

A bunch of comments, but probably the most important one is that the inner loop feels uncomfortably chaotic:

  • there's a huge function with a lot of stuff in scope and referenced, it's unclear what the data flow is
  • there are very many (hopefully, unneccesarily many) concepts and data structures
  • it's not really clear what the motivation for the concepts are, comments mostly describe how the algorithm uses them

I have a suspicion that all the allocation and indirection limits performance too, but at the moment that's a secondary concern.

I realize the minimal complexity is probably unavoidably high here but it'd be good if we can sketch out the data structures from first principles and try to find a way to simplify them.


As discussed offline, the algorithm is simplest if the RuleIDs themselves reflect the topological order in which we want to reduce them.

We require the RuleIDs to be grouped by the SymbolID of their LHS, but I think don't have much of a requirement on SymbolIDs themselves. (Currently they're alphabetical and we binary-search for _ in the grammar constructor, but not much else).

So I think we can just sort the symbols with this topo order when constructing the grammar.
(This is fairly self-contained, probably affects dumps from tests, could be a separate patch if you like)


... can be reduced using multiple rules
(avoid relying on subtleties of what the identify of a "reduce action" is)


base hasn't been defined. In general, you've only mentioned the special cases here, but not the basic case. (case 0?)


this is confusing because below you define "ReduceAction" to be specific to a base.

I'd say "when a stack head can be reduced by a rule, but there are multiple possible bases for the reduction (due to a previous merge), ..."


This comment seems a bit confusing: it suggests that all reductions from a state are naturally "the same reduction" unless we're explicitly splitting them out for different paths here.

However this doesn't really match the grammatical meaning of the word "reduction", it's the act of reducing some symbols using a rule. If the symbols or the rule change, it's a different reduction.

Maybe "A reduction is characterized by the symbols on the stack being reduced and the rule used to transform them. There may be multiple reductions possible for the same rule, if some node near the top of the stack has multiple parents."


SmallVector. Size is at max RHS of rule


If we can use ruleid instead of grammar table, the comparator becomes stateless and this code is way easier to (re)organize :-)


nit: usually early-exit, e.g. if (LBegin != RBegin) return LBegin < RBegin

as well as reducing nesting it puts the higher-priority rules first


This isn't sufficient given your documented definition of topological order: if A and B are incomparable (nether A:=B nor B:=A) then both A and B could have topological order 5, and we could sort ReduceActions as ABAB and fail to pack the ambiguity.

We either need SymbolID as a second tiebreak or to bake the topological order into the symbolID/ruleid itself.


this is not equality of reduce actions, it's something else.

Maybe just SameRangeAndSymbol?


unmutable --> immutable


nit: I think this comment belongs above the definition of order rather than below it.


We haven't really explained *why* order is important to finding all alternatives.
The explanation here is useful, but still too far on the side of describing the code IMO.

As we reduce, new reductions may become available.
Initially, each stack head holds the terminal we just shifted.
-- expr -- + -- IDENT
We can only reduce by rules whose RHS end with this token.
After reducing by expr := IDENT we have an `expr` on the stack:
-- expr -- + -- expr
We can now reduce by expr := expr + expr.

If we reduce in arbitrary order until exhausted, we'd see all
possible reductions, but "related" reductions may not be
seen at the same time.
Reductions that interpret the same token range as the same
nonterminal should share a single Ambiguous forest node.
If they reach the same parse state, they share a GSS node too.
These nodes are immutable and so related reductions need
to happen in a batch.

So instead, we perform reductions in a careful order that
ensures all related reductions are visible at once.
A reduction of N tokens as symbol S can depend on:
  1) reducing the last M < N tokens as T, then S := ... T
  2) reducing the last N tokens as symbol T, then S := T
To handle 1), shorter reductions happen before longer ones.
To handle 2), we use the fact that S := T and T := S can't
both be possible (even transitively) in a valid grammar.
A topological order of the target symbols is used to ensure
that T is reduced before S for fixed N (in our example).

nit: capital AddToOrdered...




I think as previously discussed, I'd be happier if enumerateReducePath was an instance method and just directly wrote into OrderedReduceList (which can be a member), rather than passing around callbacks.

This seems like a tight loop to be using std::function


if OrderedReduceList was a member instead of a local here, we wouldn't need to move data from ReduceList into OrderedReduceList, we could just put it there in the first place


the overall function is too long, this next chunk "process a batch of reductions that produce the same symbol over the same token range" seems like a reasonable place to pull out a function.

We should introduce a name for this concept: maybe a reduction family?


I don't really like the complexity of the transient data structures we're building here: many hashtables and vectors for each reduction.


I'm not really sure precisely what "ambiguities", "equality" and "predecessors" mean in this comment.
Can we be more specific?


this doesn't make sense inside the loop, the batch has the same target symbol by definition


seems a bit wasteful we have to materialize a temporary vector to copy from just because our first one is in the wrong order! Can we have it be in the right order instead?


avoid same name for type & variable


so IIUC given we're reducing 3 symbols, head is Z, Z.parent is Y:

  • if Y has two parents X1, X2, then we may multiple reduce paths/ReduceActions [Z Y X1] [Z Y X2]
  • but if Y.parent is X and X has two parents W1 W2, then we have *one* reduce path [Z Y X] and two BaseInfos for W1 and W2.

why? why not rather say that there's just one ReduceAction concept and the Base is part of its identity, and it gets built by enumerateReducePath?


partition seems like a really important concept here, but isn't defined.


I'm not really sure what this chunk of commented-out code is trying to say


Again, you're talking about ambiguity, but haven't ever really defined it (apart from how we're going to represent parse ambiguity in the forest). Can we use more precise language? It's hard to understand what the purpose of this block is.


nit: comment just echoes the code


sure, but why?


we're iterating over BuiltForestNodes in order to... build forest nodes

I guess BuiltForestNodes should rather be SequenceNodes?

Why is it that we're grouping by GSS node here? My understanding was we wanted one forest node per (nonterminal, token range), but GSS nodes are finer grained than that (e.g. may be differentiated only by parse state, possibly parent list too?)

sammccall added inline comments.Mar 23 2022, 4:43 PM

how can we be sure we're not creating duplicate forest nodes?

IIUC, it's possible to have the same forest nodes on top of several heads of the stack.
Distinct GSS nodes due to different states, but same forest nodes.

Then the states may have overlapping itemsets, and both allow the same reduction.
Here we unconditionally produce a forest sequence node for each ReduceAction, and we will have two ReduceActions with the same forestnodes on the stack.

Some notes before our meeting.

It does appear that it's possible to generate duplicate forest nodes in this way, and AFAIK any method other that explicitly deduplicating creating using a map<(rule, rhsnodes), sequencenode> is going to have this problem.
The good news is:

  • the cache lifetime is local to the family (outer batch), collisions necessarily happen within a family (outer batch). (We can make it a member and clear it, to reuse storage)
  • I think we can form the forest nodes at the bottom of the enumerateReducePaths() call, and use them + base GSS node in place of the reduce path. (we never actually used internal gss path nodes, just the forest nodes + GSS base).
  • Now reductions found by enumerateReducePaths() is identified by (base, sequence node, new state) which are cheap to compare/group in various ways

I feel like this should simplify subsequent steps, but need to think more