This is an archive of the discontinued LLVM Phabricator instance.

[pseudo] Add error-recovery framework & brace-based recovery
ClosedPublic

Authored by sammccall on Jun 23 2022, 7:31 PM.

Details

Summary

The idea is:

  • a parse failure is detected when all heads die when trying to shift the next token
  • we can recover by choosing a nonterminal we're partway through parsing, and determining where it ends through nonlocal means (e.g. matching brackets)
  • we can find candidates by walking up the stack from the (ex-)heads
  • the token range is defined using heuristics attached to grammar rules
  • the unparsed region is represented in the forest by an Opaque node

This patch has the core GLR functionality.
It does not allow recovery heuristics to be attached as extensions to
the grammar, but rather infers a brace-based heuristic.

Expected followups:

  • make recovery heuristics grammar extensions (depends on D127448)
  • add recover to our grammar for bracketed constructs and sequence nodes
  • change the structure of our augmented _ := start rules to eliminate some special-cases in glrParse.
  • (if I can work out how): avoid some spurious recovery cases described in comments
  • grammar changes to eliminate the hard distinction between init-list and designated-init-list shown in the recovery-init-list.cpp testcase

Diff Detail

Event Timeline

sammccall created this revision.Jun 23 2022, 7:31 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 23 2022, 7:31 PM
Herald added a subscriber: mgrang. · View Herald Transcript
sammccall updated this revision to Diff 440376.Jun 27 2022, 1:53 PM
sammccall retitled this revision from xxx recover to [pseudo] Add error-recovery framework & brace-based recovery.Jun 27 2022, 1:53 PM
sammccall edited the summary of this revision. (Show Details)

add tests, clean up

sammccall updated this revision to Diff 440379.Jun 27 2022, 1:59 PM

reduce after final recovery
comments

sammccall published this revision for review.Jun 27 2022, 2:04 PM
sammccall added a reviewer: hokein.
Herald added a project: Restricted Project. · View Herald TranscriptJun 27 2022, 2:04 PM
sammccall updated this revision to Diff 440383.Jun 27 2022, 2:04 PM

revert format changes

sammccall updated this revision to Diff 440393.Jun 27 2022, 2:13 PM
sammccall edited the summary of this revision. (Show Details)

update testcase

This revision was not accepted when it landed; it landed in state Needs Review.Jun 28 2022, 12:08 PM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.
sammccall reopened this revision.Jun 28 2022, 12:10 PM

Sorry, I committed this by mistake when working on another change.
Reverted and this is ready for review.

the patch looks a good start to me, some initial comments (mostly around the recovery part).

clang-tools-extra/pseudo/include/clang-pseudo/GLR.h
149

If I read it correctly, consuming zero token is consider failure of the function, right?

clang-tools-extra/pseudo/include/clang-pseudo/grammar/Grammar.h
135

So each rule will have at most 1 recovery-strategy (it is fine for the initial version), but I think in the future we want more (probably we need to change the Sequence to an array of {SymbolID, RevoeryStrategy}).

selection-statement := IF CONSTEXPR_opt ( init-statement_opt condition ) statement ELSE statement

We might want different recoveries in ( . init-statement_opt condition ) (init-statement_opt condition) . statement, ELSE . statement.

clang-tools-extra/pseudo/include/clang-pseudo/grammar/LRTable.h
269

I see the motivation of the OffsetTable structure now, this would come as a follow-up to simplify the ReduceOffset and RecoveryOffset, right?

clang-tools-extra/pseudo/lib/GLR.cpp
30

nit: unsigned => Token::Index

45

The GLR.cpp file is growing significantly, I think the recovery part is large enough to be lived in a separate file GLRRecovery.cpp, but the declaration can still be in the GLR.h.

65

this is not implemented, right? Add a FIXME?

69

nit: maybe name it Parses or PartialParses. Path make me think this is a patch of GSS nodes.

84

I think you're right -- I thought the first GSS node with a recovery state we encounter during the Walkup state is the node we want.

This example is a little confusing (it still matches our previous mental model), I didn't get it until we discussed offline. I think the following example is clearer

parsing the text `if (true) else ?`

IfStmt := if (stmt) else . stmt - which we're currently parsing
IfStmt := if (.stmt) else stmt  - (left) the first recovery GSS node, should not recover from this
IfStmt := . if (stmt) else stmt - (up), we should recover from this

I also think it is worth to add it into the test.

89

I can't think of a better solution other than a search-based one (would like to think more about it).

Currently, we find all recovery options by traversing the whole GSS, and do a post-filter (based on the Start, and End). I might do both in the DFS (which will save some cost of traversing), the DFS will give us the best recovery options, and then we build the GSS node, and forest node. But up to you.

92

nit: Walkup seems a bit clearer than DFS.

112

any particular reason why we iterate the OldHeads in a reversed order?

131

nit: might be clearer to move the assertion to the beginning of the function.

134

further right should be further left, right?

150

If we find a better recovery option, we're throwing all the newly-built heads and forest nodes, this seems wasteful. I think we can avoid it by first finding the best solutions and creating new heads and gss nodes for them.

151

The Line135-Line150 code looks like a good candidate for an evaluateOption function.

175

Advancing the TokenIndex here seems a little surprising and doesn't match what the comment says ( On failure, NewHeads is empty and TokenIndex is unchanged.). I think this should be done in caller side.

624

currently, it is fine for bracket-based recovery. In the future, we will need to run a force reduction for Heads regardless of the lookahead token before running the glrRecover, add a FIXME?

625

If the glrRecover returns a bool indicating whether we're able to recover, then the code is simpler:

if (NextHeads.empty() && !glrRecover(...))
  return Params.Forest.createOpaque(...);
clang-tools-extra/pseudo/unittests/GLRTest.cpp
385

nit: for GSS nodes, I will name them like GSSNode<StateID>, it is less describable, but it can be easily distinguished from other forest nodes, and match the GSS described in the comment, which I think it is clearer.

428

I suppose there is an extra ? in the string literal, the index of the brackets (4, 5) used below doesn't match the string literal here.

sammccall marked 14 inline comments as done.

address comments

sammccall updated this revision to Diff 441154.Jun 29 2022, 1:41 PM

address comments

clang-tools-extra/pseudo/include/clang-pseudo/GLR.h
149

Consuming zero tokens + producing heads is success, consuming zero tokens + not producing heads is failure.
Tweaked this comment a bit.

(Also fixed a bit of logic that ignored recovery that didn't advance TokenIndex!)

clang-tools-extra/pseudo/include/clang-pseudo/grammar/Grammar.h
135

Added a comment to call out this limitation.
There are several ways to deal with this, e.g. we could split up the grammar into multiple rules, each with one recovery. (But the approach you describe makes sense too)

clang-tools-extra/pseudo/include/clang-pseudo/grammar/LRTable.h
269

Yes. Though I'm on the fence about whether it's worth it with one case (it's a bit awkward to generalize the building IIRC).

clang-tools-extra/pseudo/lib/GLR.cpp
45

This is interesting, recover/shift/reduce/parse are (vertically) self-contained enough that it didn't seem like a big problem yet...

If the concern is file length, maybe we'd thather start with reduce; if it's relatedness, GSS?

My line count is:

  • recover: 156
  • shift: 47
  • reduce: 319
  • parse: 87
  • GSS: 66
69

Renamed to DiscardedParse, does that work for you?

84

Fair enough, a couple of problems with that example though:

  • dropping the "then" clause from the grammar of an if statement is confusing (but adding it back in makes the example complicated)
  • using a non-kernel item for the last fails to show the "up" edge clearly (but there's not enough context to show a kernel item)

Came up with another one that hopefully works.

I didn't manage to write a reasonable test showing it - it basically can't happen with bracket recovery. In order to demonstrate it we need { in the "wrong" recovery construct to be paired with a } after the cursor... which makes it look *very* much like a correct recovery.

Added a FIXME to add a testcase once we can define new recovery strategies instead.

89

I'm not sure what you mean by a search-based one, can you elaborate?

Currently, we find all recovery options by traversing the whole GSS, and do a post-filter (based on the Start, and End). I might do both in the DFS (which will save some cost of traversing)

This replaces a relatively simple traversal of GSS nodes (which there aren't that many of) with running the recovery strategy more times - this is (usually) a more complicated algorithm running over tokens, which is (usually) a larger data set. It seems likely to be a bad trade performance-wise.

In any case, performance isn't terribly critical here, and mixing the discovery & evaluation of options seems harder to read & debug (we lose the nice documented data structures we can debug, predictable -debug dumping of not-taken recovery options, etc).

112

Not that I can remember, and the unit tests don't seem to care, so removed

131

The purpose of the assertion is to make it obvious that NewHeads.clear() only discards items added during the loop below.
An assertion at the top of the function would be less useful for this, both because it's not on the screen when reading NewHeads.clear() and because it's much less locally obvious that nothing has been pushed onto NewHeads at some prior point in the function.

151

What would the signature of such a function look like?
There are many different possible outcomes:

  • replace the set with this option (BestOptions.clear)
  • add this option to the set (BestOptions.push_back)
  • discard this option (continue)
  • discard this option and all future options (break)
  • update recovery range or not

Access to control flow (break/continue) and read/write access to RecoveryRange and BestOptions seem like the natural way to express this, so it's not clear what a function could abstract away.

175

Oops, this shouldn't be done at all - originally I had a different contract for glrRecover().

(In fact, if NewHeads is empty after glrRecover() we never look at TokenIndex again, but let's not rely on that)

624

Done. This impacts the internal structure of the recovered node much more than it does which recoveries are available, so I think we can defer this for a fair while.

625

I had this initially, but:
a) it's redundant, and makes it less locally obvious that we've repopulated NewHeads
b) soon recovery will always succeed, so !NewHeads.empty() becomes an assertion

clang-tools-extra/pseudo/unittests/GLRTest.cpp
385

Yeah, I found those testcases difficult to read :-(
I did work out that the numbers might be state numbers rather than node numbers, but I found it hard to a) remember that, since it's often GSSNode0, GSSNode1 etc and b) remember which state is which.
It doesn't generalize, e.g. the RecoverRightmost case has 3 nodes with the same state.

428

Oops, yes!
Originally this testcase was meant to test more things, and apparently I didn't finish simplifying it...

hokein accepted this revision.Jul 5 2022, 7:08 AM

Looks great, let's ship it! feel free to land it in any form you think it is suitable.

clang-tools-extra/pseudo/include/clang-pseudo/GLR.h
149

nit: remove the duplicated consumes.

clang-tools-extra/pseudo/include/clang-pseudo/grammar/LRGraph.h
148

nit: mention the Result must be a nonterminal.

clang-tools-extra/pseudo/include/clang-pseudo/grammar/LRTable.h
77

nit: unrelated method?

269

A motivating bit is that it is tricky to implement a correct get method (we both made the same out-of-bound issue). I'm fine with the current form, we can revisit it afterwards.

clang-tools-extra/pseudo/lib/GLR.cpp
45

Yeah, indeed these pieces can be split, my main concern is not file length -- I would prefer keeping shift/reduce/parse into a single file, as they form a complete GLR algorithm (splitting them would make it harder to read and follow).

Recovery seems like a different thing, I would image this part of code will grow more in the future

  • the GLR algorithm has the core recovery mechanism framework with some fallback recovery strategies (eof, brackets)
  • we have different recovery strategies implemented (some of them might be cxx-specific, probably be part of pseudoCXX library);
69

It is better than the original name. The DiscardedParse is a bit weird when we start to put it under the opaque node, in that sense, they are not discarded, IMO

84

oops, your example makes more sense -- I didn't notice that I missed the if-body stmt.

89

I'm not sure what you mean by a search-based one, can you elaborate?

The search-based one refers to the current one -- basically we perform a brute-force search for all available recoveries and get the best one.

In any case, performance isn't terribly critical here, and mixing the discovery & evaluation of options seems harder to read & debug (we lose the nice documented data structures we can debug, predictable -debug dumping of not-taken recovery options, etc).

That's fair enough.

131

Yeah, but I'd treat it as a contract of the API (the NewHeads argument passed to glrRecover must be empty).

btw, the empty assertion is missing in the latest version.

179

should we worry about the case where we create a duplicated forest node? e.g. we have two best options and both recover to the same nonterminal.

clang-tools-extra/pseudo/unittests/GLRTest.cpp
490

nit: I'd probably move this to the comment mentioned in glrRecovery(), which is more discoverable.

555

nit: not sure the intention having the RecoveryEndToEnd separated from the above recover-related tests, why not grouping them together?

This revision is now accepted and ready to land.Jul 5 2022, 7:08 AM
sammccall marked 5 inline comments as done.Jul 5 2022, 11:49 AM

Thanks in particular for flagging the issue with duplicate forest nodes, you've found a hole in the model.
That said, I've left a big FIXME and I think we should patch it later.

clang-tools-extra/pseudo/include/clang-pseudo/grammar/LRGraph.h
148

Hmm, I don't think it has to be?

(To deduce a brace recovery rule we require it, but that doesn't mean it needs to be true in general)

clang-tools-extra/pseudo/lib/GLR.cpp
45

It's hard to tell yet, but I guess I don't really see why it's going to be easier to reason about shift/recover in isolation vs shift/reduce. Let's see how this goes past the initial patch (in particular once we move stuff into extensions)

131

Yes, I think this is obsolete - the latest version of glrRecover no longer requires NewHeads to be empty at all, as it doesn't use it as scratch space.

179

Oh no, you're right... we also have all the options for the GSS node being either the same/different in that case.

And it gets worse, what if we have one recovery -> X, and another recovery -> Y, and a rule X := Y. The following glrReduce will produce a duplicate X, instead of an AmbiguousX{ OpaqueX, SequenceX{ OpaqueY } }.

As much as I hate this, I think we should slap a FIXME on it and move on. I don't think multiple tied recoveries is common, and the solution here is just as likely to break the tie as it is to fix this with fancy algorithms.
However I don't think we have enough data to make a decision on exactly what to do yet.

clang-tools-extra/pseudo/unittests/GLRTest.cpp
555

this tests glrParse, and it's grouped with the other glrParse tests consistent with this file (e.g. GLRReduceOrder is with the glrParse tests, not with the glrReduce tests)

This revision was landed with ongoing or failed builds.Jul 5 2022, 11:50 AM
This revision was automatically updated to reflect the committed changes.
sammccall marked an inline comment as done.