This is an archive of the discontinued LLVM Phabricator instance.

[pseudo] Implement the GLR parsing algorithm.
ClosedPublic

Authored by hokein on Mar 7 2022, 12:54 PM.

Details

Summary

This patch implements a standard GLR parsing algorithm, the core piece of the pseudoparser:

  • it parses preprocessed C++ code, currently it supports correct code only and parse them as a translation-unit;
  • it produces a forest which stores all possible trees in an efficient manner (only a single node being build for per (SymbolID, Token Range)); no disambiguation yet;

Diff Detail

Event Timeline

hokein created this revision.Mar 7 2022, 12:54 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 7 2022, 12:54 PM
Herald added subscribers: mgrang, mgorny. · View Herald Transcript
hokein requested review of this revision.Mar 7 2022, 12:54 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 7 2022, 12:54 PM
hokein added a comment.Mar 7 2022, 1:03 PM

This is an initial version, not unittests yet, but it should be good enough for high-level reviews.

Basically, it contains all key pieces of the GLR parser:

  • ForestForest: it is a DAG with compact nodes (8 bytes per node). It is not mutable by design. The code is mostly derived from our prototype, and might need some improvements.
  • Graph-structured stack: the GSS is a simple implementation -- only allocation for new nodes, and no deallocation for dead nodes. We should figure out whether it is worth the effort to implement a smart deallocation. The GSS only affects the peak memory usage during parsing, it can be thrown away after we build the forest. In addition, the forest node is stored in the graph node, rather than the edge (per our discussion, it felt more natural and better fit to our mental model to store forest nodes in edges, but I found that the implementation was awkward, and finally abandoned that).
  • Core GLR parsing algorithm: it should be in a good state for review, it is missing the bit of producing ambiguous forest-node (as we use a non-mutable forest, this'd require some careful ordering when performing reduce actions, the implementation is tricky and takes some complexity, plan to do it in a follow up patch).

Some misc questions:

  • is the current output of the forest ok? (I think in general it is ok)
  • Any ideas about testing & debugging? Verifying the output forest is a way to test the GLR parser, but it doesn't seem to be an ideal way if we want to inspect the internal states of the parser during parsing. The alternative is to define a logger interface, the parser can invoke it at some points, so that we can inject some observers into the parser, and can use the log messages for testing and debugging purposes (see the LLVM_DEBUG usage in GLRParser.cpp).
hokein updated this revision to Diff 413609.Mar 7 2022, 1:05 PM

some tweaks.

Nice! Some early comments, I haven't gotten deep into the GLR algorithm itself

clang/include/clang/Tooling/Syntax/Pseudo/Forest.h
9 ↗(On Diff #413609)

nit: i think we should reverse the emphasis: the one-sentence summary should say what this models (a set of possible parse trees) and then elaborate below that it's the output of GLR

12 ↗(On Diff #413609)

I think we should be a bit more verbose than "forest is a dag" since this is very confusing/surprising (at least to me).

Maybe

A parse forest represents a set of possible parse trees."
Despite the name, the data structure is a parse-tree-like DAG with a single root.
Multiple ways to parse the same tokens are represented as an Ambiguous node with the possible interpretations as children.
Common sub-parses are shared: if two interpretations both parse the token range "1 + 1" as expr := expr + expr, they will share a Sequence node representing this.
13 ↗(On Diff #413609)

nit: numberous -> numerous

13 ↗(On Diff #413609)

wording is a bit misleading here:

  • suggests that the primary virtue is being space efficient when really it eliminates other forms of redundancy too (like traversal)
  • calling these "subtrees" is a bit confusing since they're not trees (but represent sets of trees)
29 ↗(On Diff #413609)

We should say something more than this about node structure:

  • nodes represent ways to interpret a sequence of tokens
  • a node always interprets a fixed set of tokens as a fixed symbol kind
  • some nodes may have children, and have pointers to them
  • all nodes may have multiple parents, but do not know them
30 ↗(On Diff #413609)

I wonder if we should put these in a namespace forest::Node rather than giving them the "Forest" prefix?

46 ↗(On Diff #413609)

this sentence doesn't make sense

57 ↗(On Diff #413609)

maybe startIndex or startToken? Loc reminds me too much of SourceLocation...

sammccall added inline comments.Mar 7 2022, 4:00 PM
clang/include/clang/Tooling/Syntax/Pseudo/Forest.h
111 ↗(On Diff #413609)

it seems weird to have this state... can we replace init + terminal() with
ArrayRef<ForestNode> createTerminals(const TokenStream&)
and have the parser responsible for keeping track of pointers (as it does for other nodes)?

This would also make this fit the pure "arena" concept better.

I'm digging through notes on the prototype and I'm not convinced that my reasons for keeping track of these in the arena make sense.
(They were about ensuring equivalent nodes are shared between heuristic vs grammar parser, but I'm not sure why this only needs to be done for terminals, and there are other ways to do it)

clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h
114 ↗(On Diff #413609)

this function never exposes an interesting graph, as it's mostly empty before + after parse().
It's only really useful for working out how much memory is allocated.

If we were to drop this, the public interface of GLRParser would be a single function (so it need not be a public class) and Graph would disappear entirely!

With this in mind, it seems like we could replace this header file with something like:

struct ParseStats {
  unsigned GSSBytes = 0;
};
ForestNode *glrParse(const TokenStream &, const LRTable&, const Grammar&, ForestArena&, ParseStats *Stats = nullptr);

It feels like hiding the GSS might be an obstacle to debugging/testing.
However this really would need a story/API around how to interrupt the parser.
LLVM_DEBUG seems reasonable to me for now.

clang/lib/Tooling/Syntax/Pseudo/Forest.cpp
55 ↗(On Diff #413609)

commented out code

74 ↗(On Diff #413609)

a few of these comments have the wrong case on the names

clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp
33 ↗(On Diff #413609)

It actually currently looks like it would be at least as natural to have the parser operate on the sequence of terminal ForestNodes rather than on the Tokens themselves...

64 ↗(On Diff #413609)

Returning null when the input doesn't match the grammar is not at all what we want.
I know it's early, but I'm worried we're defining an API for the GLR parser that matches the textbook rather than the one we need. (Similar to concerns about the Accept action - we're going to accept any stream of tokens).

I think we need to plan a little bit how error recovery is going to work here:

  • do we plan to call this recursively inside brackets, or handle lazy-brackets in the GLR parser itself?
  • will sequence recovery (after , or between declarations) happen in the GLR parser or externally? in this top loop? how will parsing continue?
  • where would splitting heuristics happen (like interpreting x? = y? as an assignment expression, or foo(; bar; as two statements)
  • in the worst case, when we're parsing something with no sequence recovery (say an expression) and we don't match the grammar, no splitting rule applies etc... are we going to create an opaque node? Does this function do it or the caller?

I suspect it would be a better fit to:
a) have an Opaque node, if not in this patch then very soon
b) operate on an ArrayRef<Token> or ArrayRef<ForestNode>, and stop relying on eof to delimit (so we can parse subranges if needed)
c) stop relying on Accept actions, since we'll want to look at the stack when we run out of tokens in general, and Accept just models a trivial case of that

hokein updated this revision to Diff 413788.Mar 8 2022, 6:32 AM

fix a subtle bug where we might leave an unremoved node in the reduce path.

sammccall added inline comments.Mar 8 2022, 8:02 AM
clang/include/clang/Tooling/Syntax/Pseudo/Forest.h
30 ↗(On Diff #413609)

alignas(ForestNode*)

67 ↗(On Diff #413609)

dump(), dumpRecursive()

78 ↗(On Diff #413609)

Rule, Elements

90 ↗(On Diff #413609)

remove trailing _

91 ↗(On Diff #413609)

ForestNode*

100 ↗(On Diff #413609)

count()?

clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h
40 ↗(On Diff #413609)

the fact that this uses a DAG seems less important than what it represents. I think the next sentence is a better introduction, and then we should describe what data the stacks are modelling, finally we should describe the data structure after (see comment below).

43 ↗(On Diff #413609)

as with forest, I think this places too much emphasis on the memory-compactness when it's not the main benefit. (If RAM were free we still wouldn't want a big array of stacks).

43 ↗(On Diff #413609)

To address the two above comments, maybe something like.

A Graph-Structured Stack represents the multiple parse stacks for a generalized LR parser.

Each node stores a parse state, the last parsed ForestNode, and its parent(s).
There may be several heads (top of stack), and the parser operates by:
 - shift: pushing terminal symbols on top of the stack
 - reduce: replace N symbols on top of the stack with one nonterminal

The structure is a DAG rather than a stack:
- GLR allows multiple actions on the same head, producing forks (nodes with the same parent).
- The parser reconciles nodes with the same (state, ForestNode), producing joins (node with multiple parents).

The parser is responsible for creating nodes, and keeping track of the set of heads.
46 ↗(On Diff #413609)

combing -> combining

I wonder if combining is the right explanation here - unlike local ambiguity packing, it's not like we produce two things and then merge them.

What about rather: sharing stack prefixes: when the parser must take two conflicting paths, it creates two new stack head nodes with the same parent.

48 ↗(On Diff #413609)

combing -> combining
euqal -> equal

48 ↗(On Diff #413609)

the "-- as..." clause is a bit confusing and doesn't seem necessary.

I think what's missing here is a high level explanation of what's going on in the parse to trigger this, and explicitly mentioning this is how we get a node with multiple parents.

When alternative parses converge on the same interpretation of later tokens, their stack heads end up in the same state. These are merged, resulting in a single head with several parents.

53 ↗(On Diff #413609)

this shows forking but not merging
and in particular it suggests that #heads == #stacks, which is not the case

54 ↗(On Diff #413609)

arrows are pointing the wrong way (at least opposite to our pointers)

58 ↗(On Diff #413609)

the name "Graph" is IMO too general, this isn't a reusable graph or even dag class.

I think GSS is fine, it's a distinctive name from the literature, and the expansion of the abbreviation is nice and descriptive (but too verbose to be the actual name)

59 ↗(On Diff #413609)

this comment doesn't say anything, remove?

60 ↗(On Diff #413609)

alignas(Node*)

(in practice this will match ForestNode* so is a no-op, but it documents the intent)

61 ↗(On Diff #413609)

again, drop comment unless there's something to say

64 ↗(On Diff #413609)

This is not what predecessor usually means in graph theory: u is the predecessor of v if there is *some path* from u->v.

I think "parent" is a common and well-understood term.

68 ↗(On Diff #413609)

Also not sure what this comment says:

  • first line just repeats the type: type is a forest node, forest nodes are always for symbols, and terminal/nonterminal are all the possibilities
  • second line is either referring to or defining (not sure) edge labels, but edge labels aren't defined or referred to anywhere else, so why?

Maybe:

The parse tree for the last symbol we parsed.
This symbol appears on the left of the dot in the parse state.
77 ↗(On Diff #413609)

why is Parsed not part of the identity?
(maybe there's a good reason, but there should be a comment)

87 ↗(On Diff #413609)

8 parents isn't that many and it seems like this might be a dynamic property of the input code rather than a static property of the grammar.

But I don't think this bitpacking is buying anything, it looks like the layout is:

  • State : 13
  • PredecessorCount : 3
  • (padding) : 48
  • Parsed : 64

So I think we might as well just use a uint16 PredecessorCount and still have room left over for a uint16 refcount later. Is there anything else we might usefully store in the extra space?

100 ↗(On Diff #413609)

name consistently with forest arena

112 ↗(On Diff #413609)

as discussed offline, the signature here should involve a token range, or likely an ArrayRef<ForestNode> for the relevant terminals.

Probably a FIXME to allow the start symbol to be specified.

114 ↗(On Diff #413609)

After offline discussion I think we either want:

  • to hide GSS as mentioned above, and just test the forest output
  • expose GSS class, and have a two versions of the parse function: one that finalizes the GSS into a result ForestNode, and a "raw" one that just returns the GSS. Then we can get at the GSS state at a point by running the raw parser on some prefix of the code. (Or even *only* have the raw one and make it the caller's responsibility to extract the node from the GSS, but this is probably silly)
129 ↗(On Diff #413609)

I love the name frontier!
Unfortunately the word refers to the whole boundary/set, rather than individual elements of it.

Maybe this?

/// The list of actions we're ready to perform.
struct {
  std::vector<Pending> Shift;
  std::vector<Pending> Reduce;
  std::vector<Pending> Accept;
  bool empty(); // accept is an odd-one-out here, but I think it's going away?
} Frontier;

Also maybe a FIXME that the frontier needs to order reductions so that local ambiguity packing is guaranteed to work as a single pass

133 ↗(On Diff #413609)

just Action?

clang/lib/Tooling/Syntax/Pseudo/Forest.cpp
30 ↗(On Diff #413609)

unhandled

clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp
33 ↗(On Diff #413609)

I'd suggest calling this NextTok instead of Lookahead:

  • in common english this is a verb rather than a noun
  • in parsers it more often refers to the *number* of tokens in the tables than the tokens themselves
  • referring to the token you're *currently shifting* as "ahead" is bizarre!
67 ↗(On Diff #413609)

if this is a hot loop we shouldn't be creating std::vectors and throwing them away

80 ↗(On Diff #413609)

we shift *tokens*, not states: the token conceptually moves from the input to the stack. (This seems to be other references not just me!)

if multiple stack heads will reach the same state after shifting a token?

85 ↗(On Diff #413609)

I think the arrows directions aren't particularly clear (our pointers go the opposite way) or necessary here, and backticks and up arrows are hard to read.

A couple of (widely-supported) box-drawing characters would help a lot I think:

0--1--2
   └--3

0--1--2--4
   └--3--┘

(I would keep using - and | rather than making everything pretty boxes, it's readable enough and easier to edit. We may want to consider box-drawing characters for dump functions though...)

93 ↗(On Diff #413609)

I can't parse a "perform" shift.

Batch shifts by target state so we can merge matching groups?

97 ↗(On Diff #413609)

I don't think tiebreaking by Head achieves anything.

If we really need to ensure deterministic behavior, comparing pointers won't do that as allocation may vary.
On the other hand, stable_sort should work: we'll tiebreak by order enqueued, so we're deterministic if our inputs are.

108 ↗(On Diff #413609)

SmallVector

110 ↗(On Diff #413609)

llvm::is_contained

114 ↗(On Diff #413609)

if this turns out to be hot, you could avoid the temporary array of predecessors: return a mutable node from addNode

128 ↗(On Diff #413609)

name doesn't match return type

131 ↗(On Diff #413609)

I suspect S{0} will be verbose enough, it's not likely spelling out state will make this much less cryptic, but it may be harder to scan

141 ↗(On Diff #413609)

We're leaning on std::function a lot for code that's supposedly in the hot path.
It's probably fine, but I'd like to see this written more directly.

As far as I can tell, this could easily be a class, with enumerateReducePath a (recursive) method that concretely calls into handleReducePath() or something at the bottom.

Also if it's a class we can easily reset() instead of destroying it if we want to reuse its vectors across separate reductions.

158 ↗(On Diff #413609)

(just a note, I need to dig into this part tomorrow!)

clang/tools/clang-pseudo/ClangPseudo.cpp
33 ↗(On Diff #413609)

nit: a C++ source file => a source file

33 ↗(On Diff #413609)

Why is this a new flag independent of Source?

It also seems inconsistent with other flags, which describe what output is requested rather than what computation should be done.

Maybe "-print-forest" and "-print-stats"? (which could also potentially control stats of other stages)

hokein marked 13 inline comments as done.Mar 21 2022, 7:43 AM

splitting the forest data structure to https://reviews.llvm.org/D122139, comments around Forest.h/.cpp should be addressed.

clang/include/clang/Tooling/Syntax/Pseudo/Forest.h
57 ↗(On Diff #413609)

renamed to startTokenIndex.

hokein updated this revision to Diff 426488.May 2 2022, 12:35 PM
hokein marked an inline comment as done.

Updates:

  • a derived version of D122408 and D121368;
  • refine the APIs, getting rid of the GLR parser, and providing fine-grained pieces to allow writing tests easier;
  • add unittests for the algorithm and a simple smoke lit test;
  • when we fail to parse the input, we return an opaque forest node rather than a nullptr;
  • rebase to the main branch;
Herald added a project: Restricted Project. · View Herald TranscriptMay 2 2022, 12:35 PM
hokein updated this revision to Diff 426493.May 2 2022, 12:47 PM

Fix the bad format from lint.

sammccall accepted this revision.May 3 2022, 4:02 AM

This looks really good, great job on the signatures & tests.
There are a few different ways to formulate the signatures of glrParse/glrReduce, and some possible optimizations, but I can't see anything that's an obvious improvement worth holding up over.

clang-tools-extra/pseudo/include/clang-pseudo/GLR.h
72

nit: *pointers* are stored as trailing objects, not the parents themselves

144

nit: this is always used synchronously, so llvm::function_ref?

147

the comment says newly-created heads are passed, but actually their inputs are passed and the callback is responsible for creating them.

(Why not have glrShift create the node and pass it to the callback? Or maybe even pass the output vector of NewHeads& in for concreteness?)

148

Maybe also mention the interaction with PendingShift?

"When this function returns, PendingShift is empty."?

152

nit: semantics of the callback aren't obvious, so I think "NewHead" is a better name than "CB"

154

When this function returns, PendingReduce is empty. Calls to NewHeadCB may add elements to PendingReduce

clang-tools-extra/pseudo/tool/ClangPseudo.cpp
39

nit: just "print statistics"? I think this should be orthogonal to other options

clang-tools-extra/pseudo/unittests/GLRTest.cpp
35

this looks so much like a GSS node: why not just use a GSS node?

138

bind is ugly :-(

maybe just have GLRTest::captureNewHeads() return a std::function with the right signature?

This revision is now accepted and ready to land.May 3 2022, 4:02 AM
hokein updated this revision to Diff 426679.May 3 2022, 6:26 AM
hokein marked 6 inline comments as done.

address remaining comments:

  • return GSS::node in the NewHeadCallback, rather the fields of GSS::Node;
  • remove the unncessary NewHeadResult structure in unittest;
hokein retitled this revision from [pseudo][WIP] Implement a GLR parser. to [pseudo] Implement the GLR parsing algorithm. .May 3 2022, 6:27 AM
hokein edited the summary of this revision. (Show Details)
hokein added inline comments.May 3 2022, 6:32 AM
clang-tools-extra/pseudo/include/clang-pseudo/GLR.h
144

we have a captureNewHeads method which returns this callback in unittest, returning llvm::function_ref is not safe -- we could make it return a std::function, but it would be nice to have a unified signature.

clang-tools-extra/pseudo/unittests/GLRTest.cpp
35

oh, right. I added this because I didn't expose the GSS in the Params in my previous version, I needed to store the Parents. Right now it is not needed, we can use the GSS::Node directly.

hokein updated this revision to Diff 426680.May 3 2022, 6:39 AM
hokein retitled this revision from [pseudo] Implement the GLR parsing algorithm. to [pseudo] Implement the GLR parsing algorithm..

fix a dangling reference of the source text in clang-pseudo.

This revision was landed with ongoing or failed builds.May 3 2022, 6:43 AM
This revision was automatically updated to reflect the committed changes.