This is an archive of the discontinued LLVM Phabricator instance.

[syntax][pseudo] Add grammar facilities for the pseudo-parser
ClosedPublic

Authored by hokein on Nov 30 2021, 5:34 AM.

Details

Summary

This patch introduces basic structures for the grammar, which is used to
build a grammar-based parser.

As the first patch, the scope is limited:

  • base types of modeling the grammar rules, symbols
  • grammar parsing bit (annotations are excluded, and will be added afterwards)

Diff Detail

Event Timeline

hokein created this revision.Nov 30 2021, 5:34 AM
hokein requested review of this revision.Nov 30 2021, 5:34 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 30 2021, 5:34 AM
sammccall added inline comments.Dec 21 2021, 2:58 AM
clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h
53

If we want strong types, we could use enum class SymbolID : uint16_t {};

75

Maybe rather "there are at most 2^12 rules"

84

// Sequence can be at most 2^4 tokens long

111

nit: building block is more idiomatic

111

Nit: it's rather the RuleSpec factories and Grammar::build (which are static) that parses BNF grammar.
The Grammar object itself doesn't have this responsibility.

134

I think this should be explicit rather than implicit, and so not just "exposed for testing" but explicitly called

145

The implementation also (inefficiently) supports looking up terminals. Intended?

155

dump suggests an arbitrary debugging representation, but we rely on this being exactly the symbol name. symbolName(SymbolID)?

Maybe also doc what the names are (for terminals).
e.g. Terminals have names like "," (kw_comma) or "OPERATOR" (kw_operator).

157

in the prototype the "recursively" versions ended up being much less useful than I thought. You may want to drop the complexity

172

Are compiled transition tables etc expected to live here, or is this a purely descriptive version of teh grammar?

clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp
49

I don't think it makes sense to compute and store nullability if our plan is to build a parser that doesn't support nullable symbols.

Rather we would warn in eliminateOptional() or diagnose() if there are nullable rules. (This doesn't require computing the nullability of every rule)

hokein updated this revision to Diff 401175.Jan 19 2022, 4:46 AM
hokein marked 8 inline comments as done.

refine the patch:

  • address comments
  • remove unneeded code
  • move the BNF-grammar parsing code to a dedicate class, rather in Grammar.
hokein added inline comments.Jan 19 2022, 4:50 AM
clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h
53

we could do that, but I'd leave it as-is at the moment (until we figure out more details about static Grammar construction).

111

Yeah, you're right. Move the BNF-parsing bit to a dedicated place.

145

this is a method function only being used for testing, I don't think we need it. Moved to the test.

172

The later, the Grammar class purely handles grammar symbols/rules. Transition table will be lived somewhere else (table parser), and it will be generated from the TransitionTableBuilder which depends on Grammar.

sammccall accepted this revision.Feb 2 2022, 7:47 AM
sammccall added inline comments.
clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h
86

We're being quite fiddly with layout but actually pretty wasteful.

wc says 652 with 1323 total RHS elements on the RHS - average ~2.
So we could aim for 6 bytes per rule = 4KB but we're using 34 bytes per rule = 22KB.

Can't see a nice concrete way to achieve that though.

88

well, we can set MaxElements to 9 instead of 16, that reduces the rule table from 22kB to 13K...

92

Why is this wrapped in a struct?

124

maybe rulesFor(SymbolID)?

having lookupRules/lookupRule seems like a trap given SymbolID and RuleID have very different semantics and are typedefs for each other!

150

This comment could be more explicit about the relationship between the rules and the symbol!

clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp
50

nit: drop braces, consistent with rest of file

clang/lib/Tooling/Syntax/Pseudo/GrammarBNF.cpp
2

I love the split of bnf parsing into its own file.

Given this file only deals with the parsing, I'm not sure so much has to be encapsulated into the GrammarBuilder class.
I'd probably move things like RuleSpec struct, the ParseLine function etc out to file level, making the (lack of) data/state flow more explicit.
Up to you though.

141

parseline can be a method (or function)

156

Seems like it'd be nicer to trim the trailing # comment off the line before calling ParseLine?

188

nit: "Rule '{0}' has nullable RHS"?

237

"Rule '{0}' has empty RHS", G.dumpRule(RID)?

this points us closer to the source of the error

clang/unittests/Tooling/Syntax/Pseudo/GrammarTests.cpp
2

mangled comment

45

include a message here

or better, ADD_FAILURE("No such symbol: ") << Name first

61

remove

This revision is now accepted and ready to land.Feb 2 2022, 7:47 AM
hokein updated this revision to Diff 405550.Feb 3 2022, 2:05 AM
hokein marked 12 inline comments as done.

address comments

hokein added inline comments.Feb 3 2022, 2:06 AM
clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h
86

Yes :(
I'd keep it as-is (and decrease the MaxElements to 9). One idea is to use a flat array as a center storage of all sequences, then Rule class just need an index, and size. But there will be more data (annotations, hooks) in this class, we could figure it out later.

sammccall accepted this revision.Feb 3 2022, 2:16 AM
sammccall added inline comments.
clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h
86

Yes :(
I'd keep it as-is (and decrease the MaxElements to 9).

SGTM

One idea is to use a flat array as a center storage of all sequences, then Rule class just need an index, and size.

This makes APIs awkward though: Rule.seq() needs extra params to build an ArrayRef, unless you want to carry around a pointer to the big array (which gives back most of the size savings)

clang/lib/Tooling/Syntax/Pseudo/GrammarBNF.cpp
141

move this after the take_while, so that # this is a comment is still skipped?

154

ParseLine -> parseLine

This revision was landed with ongoing or failed builds.Feb 3 2022, 2:29 AM
This revision was automatically updated to reflect the committed changes.
thakis added a subscriber: thakis.Feb 3 2022, 7:46 AM

Is there any background information anywhere what this pseudo parser is?

thakis added inline comments.Feb 3 2022, 7:51 AM
clang/lib/Tooling/Syntax/Pseudo/CMakeLists.txt
3

(i think the usual name would be "clangToolingSyntaxPseudo")

thakis added inline comments.Feb 3 2022, 7:53 AM
clang/unittests/Tooling/Syntax/Pseudo/CMakeLists.txt
6
thakis@thakis:~/src/llvm-project$ ls clang/unittests/**/*Test.cpp | wc -l
127
thakis@thakis:~/src/llvm-project$ ls clang/unittests/**/*Tests.cpp | wc -l
1

Sure - I'll add a readme/links in this directory too, just forgot this was the first commit.

hokein added inline comments.Feb 4 2022, 12:59 AM
clang/lib/Tooling/Syntax/Pseudo/CMakeLists.txt
3

(the name is too verbose, and we have a rough plan to lift Syntax directory up to Tooling, no timeline yet. For now, it might be better to follow the existing name pattern, fixed in b94f09524efe2789598eb8c1bf5f44f5b17148d6.

clang/unittests/Tooling/Syntax/Pseudo/CMakeLists.txt
6