This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang-tools-extra/pseudo/
-
pseudo/
-
include/clang-pseudo/grammar/
-
clang-pseudo/
-
grammar/
1/1
LRTable.h
-
lib/
2/4
GLR.cpp
-
unittests/
-
GLRTest.cpp

Differential D130523

[pseudo] Perform unconstrained recovery prior to completion.
ClosedPublic

Authored by sammccall on Jul 25 2022, 3:00 PM.

Download Raw Diff

Details

Reviewers

hokein

Commits

rG2cc7463c85c0: [pseudo] Perform unconstrained reduction prior to recovery.

Summary

Our GLR uses lookahead: only perform reductions that might be consumed by the
shift immediately following. However when shift fails and so reduce is followed
by recovery instead, this restriction is incorrect and leads to missing heads.

In turn this means certain recovery strategies can't be made to work. e.g.

ns := NAMESPACE { namespace-body } [recover=Skip]
ns-body := namespace_opt

When namespace { namespace { is parsed, we can recover the inner ns (using
the Skip strategy to ignore the missing }). However this namespace will
not be reduced to a namespace-body as EOF is not in the follow-set, and so we
are unable to recover the outer ns.

This patch fixes this by tracking which heads were produced by constrained
reduce, and discarding and rebuilding them before performing recovery.

This is a prerequisite for the Skip strategy mentioned above, though there are
some other limitations we need to address too.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

sammccall created this revision.Jul 25 2022, 3:00 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 25 2022, 3:00 PM

sammccall requested review of this revision.Jul 25 2022, 3:00 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 25 2022, 3:00 PM

Herald added subscribers: cfe-commits, alextsao1999. · View Herald Transcript

restore removed debug

Apart from this patch, the main other things we need to allow missing brackets to be inferred:

allow recovery to trigger subsequent recovery, even at EOF. (Simplest way is to address the FIXME at 660, it's pretty involved)
allow opaque nodes to represent terminals

I have a prototype of all this working together locally, it seems to work...

Harbormaster completed remote builds in B177480: Diff 447487.Jul 25 2022, 3:58 PM

sammccall mentioned this in D130551: [pseudo] Allow opaque nodes to represent terminals.Jul 26 2022, 3:37 AM

hokein accepted this revision.Jul 28 2022, 7:01 AM

hokein added inline comments.

clang-tools-extra/pseudo/include/clang-pseudo/grammar/LRTable.h
109	this `if` doesn't make sense, and is not needed, I think.
clang-tools-extra/pseudo/lib/GLR.cpp
625	I think we can move the Line634 `Heads.resize(HeadsPartition)` before the `glrShift()` as we only do shift on the nearly-created heads, we might gain some performance back.
653	Can we have some comments on `GLRReduce::operator()` on how does parameter `Head` get modified (new heads are appended to it)?

This revision is now accepted and ready to land.Jul 28 2022, 7:01 AM

hokein added inline comments.Jul 28 2022, 7:11 AM

clang-tools-extra/pseudo/lib/GLR.cpp
625	oops, my previous comment is incorrect, here we want the second part of the partition; while on recovery, we want the first part of partition. we can pass `llvm::ArrayRef<const GSS::Node *>(Heads).drop_front(HeadsPartition);` as the Heads to `glrShift`.

sammccall marked 2 inline comments as done.Aug 19 2022, 6:14 AM

sammccall added inline comments.

clang-tools-extra/pseudo/lib/GLR.cpp
625	As discussed offline, shifting onto a head that was produced by shift should be allowed. given grammar foo := [ ] bar := [ baz := bar ] and input `[]`, after `[` we have `Heads={ [, bar} }` with the former shifted and the latter reduced. If we applied your suggestion here, we would fail to parse `foo` (but would succeed in parsing `baz`).

Closed by commit rG2cc7463c85c0: [pseudo] Perform unconstrained reduction prior to recovery. (authored by sammccall). · Explain WhyAug 19 2022, 6:14 AM

This revision was automatically updated to reflect the committed changes.

sammccall added a commit: rG2cc7463c85c0: [pseudo] Perform unconstrained reduction prior to recovery..

Revision Contents

Path

Size

clang-tools-extra/

pseudo/

include/

clang-pseudo/

grammar/

LRTable.h

6 lines

lib/

GLR.cpp

18 lines

unittests/

GLRTest.cpp

26 lines

Diff 453972

clang-tools-extra/pseudo/include/clang-pseudo/grammar/LRTable.h

Show First 20 Lines • Show All 99 Lines • ▼ Show 20 Lines	llvm::ArrayRef<RuleID> getReduceRules(StateID State) const {
assert(State + 1u < ReduceOffset.size());		assert(State + 1u < ReduceOffset.size());
return llvm::makeArrayRef(Reduces.data() + ReduceOffset[State],		return llvm::makeArrayRef(Reduces.data() + ReduceOffset[State],
Reduces.data() + ReduceOffset[State+1]);		Reduces.data() + ReduceOffset[State+1]);
}		}
// Returns whether Terminal can follow Nonterminal in a valid source file.		// Returns whether Terminal can follow Nonterminal in a valid source file.
bool canFollow(SymbolID Nonterminal, SymbolID Terminal) const {		bool canFollow(SymbolID Nonterminal, SymbolID Terminal) const {
assert(isToken(Terminal));		assert(isToken(Terminal));
assert(isNonterminal(Nonterminal));		assert(isNonterminal(Nonterminal));
return FollowSets.test(tok::NUM_TOKENS * Nonterminal +		// tok::unknown is a sentinel value used in recovery: can follow anything.
		if (tok::unknown)
		hokeinUnsubmitted Done Reply Inline Actions this `if` doesn't make sense, and is not needed, I think. hokein: this `if` doesn't make sense, and is not needed, I think.
		return true;
		return Terminal == tokenSymbol(tok::unknown) \|\|
		FollowSets.test(tok::NUM_TOKENS * Nonterminal +
symbolToToken(Terminal));		symbolToToken(Terminal));
}		}

// Looks up available recovery actions if we stopped parsing in this state.		// Looks up available recovery actions if we stopped parsing in this state.
llvm::ArrayRef<Recovery> getRecovery(StateID State) const {		llvm::ArrayRef<Recovery> getRecovery(StateID State) const {
return llvm::makeArrayRef(Recoveries.data() + RecoveryOffset[State],		return llvm::makeArrayRef(Recoveries.data() + RecoveryOffset[State],
Recoveries.data() + RecoveryOffset[State + 1]);		Recoveries.data() + RecoveryOffset[State + 1]);
}		}
▲ Show 20 Lines • Show All 160 Lines • Show Last 20 Lines

clang-tools-extra/pseudo/lib/GLR.cpp

Show First 20 Lines • Show All 595 Lines • ▼ Show 20 Lines	const ForestNode &glrParse(const ParseParams &Params, SymbolID StartSymbol,
llvm::ArrayRef<ForestNode> Terminals = Params.Forest.createTerminals(Params.Code);		llvm::ArrayRef<ForestNode> Terminals = Params.Forest.createTerminals(Params.Code);
auto &GSS = Params.GSStack;		auto &GSS = Params.GSStack;

StateID StartState = Lang.Table.getStartState(StartSymbol);		StateID StartState = Lang.Table.getStartState(StartSymbol);
// Heads correspond to the parse of tokens [0, I), NextHeads to [0, I+1).		// Heads correspond to the parse of tokens [0, I), NextHeads to [0, I+1).
std::vector<const GSS::Node > Heads = {GSS.addNode(/State=*/StartState,		std::vector<const GSS::Node > Heads = {GSS.addNode(/State=*/StartState,
/ForestNode=/nullptr,		/ForestNode=/nullptr,
{})};		{})};
		// Invariant: Heads is partitioned by source: {shifted \| reduced}.
		// HeadsPartition is the index of the first head formed by reduction.
		// We use this to discard and recreate the reduced heads during recovery.
		unsigned HeadsPartition = 0;
std::vector<const GSS::Node *> NextHeads;		std::vector<const GSS::Node *> NextHeads;
auto MaybeGC = [&, Roots(std::vector<const GSS::Node *>{}), I(0u)]() mutable {		auto MaybeGC = [&, Roots(std::vector<const GSS::Node *>{}), I(0u)]() mutable {
assert(NextHeads.empty() && "Running GC at the wrong time!");		assert(NextHeads.empty() && "Running GC at the wrong time!");
if (++I != 20) // Run periodically to balance CPU and memory usage.		if (++I != 20) // Run periodically to balance CPU and memory usage.
return;		return;
I = 0;		I = 0;

// We need to copy the list: Roots is consumed by the GC.		// We need to copy the list: Roots is consumed by the GC.
Roots = Heads;		Roots = Heads;
GSS.gc(std::move(Roots));		GSS.gc(std::move(Roots));
};		};
// Each iteration fully processes a single token.		// Each iteration fully processes a single token.
for (unsigned I = 0; I < Terminals.size();) {		for (unsigned I = 0; I < Terminals.size();) {
LLVM_DEBUG(llvm::dbgs() << llvm::formatv(		LLVM_DEBUG(llvm::dbgs() << llvm::formatv(
"Next token {0} (id={1})\n",		"Next token {0} (id={1})\n",
Lang.G.symbolName(Terminals[I].symbol()), Terminals[I].symbol()));		Lang.G.symbolName(Terminals[I].symbol()), Terminals[I].symbol()));
// Consume the token.		// Consume the token.
glrShift(Heads, Terminals[I], Params, Lang, NextHeads);		glrShift(Heads, Terminals[I], Params, Lang, NextHeads);
		hokeinUnsubmitted Not Done Reply Inline Actions I think we can move the Line634 `Heads.resize(HeadsPartition)` before the `glrShift()` as we only do shift on the nearly-created heads, we might gain some performance back. hokein: I think we can move the Line634 `Heads.resize(HeadsPartition)` before the `glrShift()` as we…
		hokeinUnsubmitted Not Done Reply Inline Actions oops, my previous comment is incorrect, here we want the second part of the partition; while on recovery, we want the first part of partition. we can pass `llvm::ArrayRef<const GSS::Node >(Heads).drop_front(HeadsPartition);` as the Heads to `glrShift`. hokein:* oops, my previous comment is incorrect, here we want the second part of the partition; while on…
		sammccallAuthorUnsubmitted Done Reply Inline Actions As discussed offline, shifting onto a head that was produced by shift should be allowed. given grammar foo := [ ] bar := [ baz := bar ] and input `[]`, after `[` we have `Heads={ [, bar} }` with the former shifted and the latter reduced. If we applied your suggestion here, we would fail to parse `foo` (but would succeed in parsing `baz`). sammccall: As discussed offline, shifting onto a head that was produced by shift should be allowed. given…

// If we weren't able to consume the token, try to skip over some tokens		// If we weren't able to consume the token, try to skip over some tokens
// so we can keep parsing.		// so we can keep parsing.
if (NextHeads.empty()) {		if (NextHeads.empty()) {
// FIXME: Heads may not be fully reduced, because our reductions were		// The reduction in the previous round was constrained by lookahead.
// constrained by lookahead (but lookahead is meaningless to recovery).		// On valid code this only rejects dead ends, but on broken code we should
		// consider all possibilities.
		//
		// We discard all heads formed by reduction, and recreate them without
		// this constraint. This may duplicate some nodes, but it's rare.
		LLVM_DEBUG(llvm::dbgs() << "Shift failed, will attempt recovery. "
		"Re-reducing without lookahead.");
		Heads.resize(HeadsPartition);
		Reduce(Heads, /allow all reductions/ tokenSymbol(tok::unknown));

glrRecover(Heads, I, Params, Lang, NextHeads);		glrRecover(Heads, I, Params, Lang, NextHeads);
if (NextHeads.empty())		if (NextHeads.empty())
// FIXME: Ensure the `_ := start-symbol` rules have a fallback		// FIXME: Ensure the `_ := start-symbol` rules have a fallback
// error-recovery strategy attached. Then this condition can't happen.		// error-recovery strategy attached. Then this condition can't happen.
return Params.Forest.createOpaque(StartSymbol, /Token::Index=/0);		return Params.Forest.createOpaque(StartSymbol, /Token::Index=/0);
} else		} else
++I;		++I;

// Form nonterminals containing the token we just consumed.		// Form nonterminals containing the token we just consumed.
SymbolID Lookahead =		SymbolID Lookahead =
I == Terminals.size() ? tokenSymbol(tok::eof) : Terminals[I].symbol();		I == Terminals.size() ? tokenSymbol(tok::eof) : Terminals[I].symbol();
		HeadsPartition = NextHeads.size();
Reduce(NextHeads, Lookahead);		Reduce(NextHeads, Lookahead);
		hokeinUnsubmitted Done Reply Inline Actions Can we have some comments on `GLRReduce::operator()` on how does parameter `Head` get modified (new heads are appended to it)? hokein: Can we have some comments on `GLRReduce::operator()` on how does parameter `Head` get modified…
// Prepare for the next token.		// Prepare for the next token.
std::swap(Heads, NextHeads);		std::swap(Heads, NextHeads);
NextHeads.clear();		NextHeads.clear();
MaybeGC();		MaybeGC();
}		}
LLVM_DEBUG(llvm::dbgs() << llvm::formatv("Reached eof\n"));		LLVM_DEBUG(llvm::dbgs() << llvm::formatv("Reached eof\n"));

// The parse was successful if we're in state `_ := start-symbol .`		// The parse was successful if we're in state `_ := start-symbol .`
▲ Show 20 Lines • Show All 108 Lines • Show Last 20 Lines

clang-tools-extra/pseudo/unittests/GLRTest.cpp

Show First 20 Lines • Show All 619 Lines • ▼ Show 20 Lines	TEST_F(GLRTest, RecoverTerminal) {
const ForestNode &Parsed =		const ForestNode &Parsed =
glrParse({Tokens, Arena, GSStack}, id("stmt"), TestLang);		glrParse({Tokens, Arena, GSStack}, id("stmt"), TestLang);
EXPECT_EQ(Parsed.dumpRecursive(TestLang.G),		EXPECT_EQ(Parsed.dumpRecursive(TestLang.G),
"[ 0, end) stmt := IDENTIFIER ; [recover=Skip]\n"		"[ 0, end) stmt := IDENTIFIER ; [recover=Skip]\n"
"[ 0, 1) ├─IDENTIFIER := tok[0]\n"		"[ 0, 1) ├─IDENTIFIER := tok[0]\n"
"[ 1, end) └─; := <opaque>\n");		"[ 1, end) └─; := <opaque>\n");
}		}

		TEST_F(GLRTest, RecoverUnrestrictedReduce) {
		// Here, ! is not in any rule and therefore not in the follow set of `word`.
		// We would not normally reduce `word := IDENTIFIER`, but do so for recovery.

		build(R"bnf(
		_ := sentence

		word := IDENTIFIER
		sentence := word word [recover=AcceptAnyTokenInstead]
		)bnf");

		clang::LangOptions LOptions;
		const TokenStream &Tokens = cook(lex("id !", LOptions), LOptions);
		TestLang.Table = LRTable::buildSLR(TestLang.G);
		TestLang.RecoveryStrategies.try_emplace(
		extensionID("AcceptAnyTokenInstead"),
		[](Token::Index Start, const TokenStream &Stream) { return Start + 1; });

		const ForestNode &Parsed =
		glrParse({Tokens, Arena, GSStack}, id("sentence"), TestLang);
		EXPECT_EQ(Parsed.dumpRecursive(TestLang.G),
		"[ 0, end) sentence := word word [recover=AcceptAnyTokenInstead]\n"
		"[ 0, 1) ├─word := IDENTIFIER\n"
		"[ 0, 1) │ └─IDENTIFIER := tok[0]\n"
		"[ 1, end) └─word := <opaque>\n");
		}

TEST_F(GLRTest, NoExplicitAccept) {		TEST_F(GLRTest, NoExplicitAccept) {
build(R"bnf(		build(R"bnf(
_ := test		_ := test

test := IDENTIFIER test		test := IDENTIFIER test
test := IDENTIFIER		test := IDENTIFIER
)bnf");		)bnf");
▲ Show 20 Lines • Show All 72 Lines • Show Last 20 Lines