This is an archive of the discontinued LLVM Phabricator instance.

[pseudo] Support parsing variant target symbols.
ClosedPublic

Authored by hokein on May 5 2022, 5:29 AM.

Details

Summary

With this patch, we're able to parse smaller chunks of C++ code (statement,
declaration), rather than translation-unit.

The target symbol is listed in the grammar in a form of `_ :=
statement`, each target symbol has a dedicated state (_ := • statement).
We create and track all these separate states in the LRTable. When we
start parsing, we lookup the corresponding state to start the parser.

LR pasing table changes with this patch:

  • number of states: 1467 -> 1471
  • number of actions: 82891 -> 83578
  • size of the table (bytes): 334248 -> 336996

Diff Detail

Event Timeline

hokein created this revision.May 5 2022, 5:29 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 5 2022, 5:29 AM
Herald added a subscriber: mgrang. · View Herald Transcript
hokein requested review of this revision.May 5 2022, 5:29 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 5 2022, 5:29 AM
sammccall accepted this revision.May 9 2022, 12:16 PM

Very nice!
And change to compiled grammar size is small.
Would including *every* nonterminal as a start symbol would blow the size up a bit?
This would eliminate some complexity in the interface.

clang-tools-extra/pseudo/include/clang-pseudo/GLR.h
125

From the caller's perspective, is "target symbol" really a clearer name than "start symbol"? The latter seems like more common terminology.

126

// A rule _ := Target must exist for the chosen target symbol.

clang-tools-extra/pseudo/lib/LRGraph.cpp
163

might be clearer just to make this a member of the builder, and add a addStartState(SymbolID, StateID) method?

clang-tools-extra/pseudo/lib/LRTableBuild.cpp
43

nit: since we're ultimately copying this array to call the constructor, maybe just take ArrayRef here?

clang-tools-extra/pseudo/lib/cxx.bnf
29–31

Maybe "_ lists all the start-symbols which we support parsing".

clang-tools-extra/pseudo/tool/ClangPseudo.cpp
46

-start-symbol might be more common

62

nit: just findNonterminal()? rather than echoing the type

This revision is now accepted and ready to land.May 9 2022, 12:16 PM
hokein updated this revision to Diff 428479.May 10 2022, 1:14 PM
hokein marked 3 inline comments as done.
  • adress comments
  • rename TargetSymbol -> StartSymbol
  • don't use StartSymbol for "_", rename and update a few occurrences
  • move findNonterminal to Grammar.h since it is used in various places
hokein marked 3 inline comments as done.May 10 2022, 1:18 PM
hokein added inline comments.
clang-tools-extra/pseudo/include/clang-pseudo/GLR.h
125

I'd like to call it start symbol as well, it is clearer.

The only downside is that we're diverging from the standard LR literature (_ is the start symbol of the augmented grammar). We need a name for the symbol _, I rename the Grammar::startSymbol to Grammar::underscore (or even _?), please take a look on this patch again.

clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h
192

it is unfortunate that we put this function in the Grammar.h. This could be avoided when we have the generated enum grammar type, but we are not there yet...

hokein updated this revision to Diff 428634.May 11 2022, 5:32 AM

add a comment in cxx.bnf explaining why not making all nonterminals as start symbol by default.

Would including *every* nonterminal as a start symbol would blow the size up a bit?
This would eliminate some complexity in the interface.

This does increase the size, some datapoints:

  • ~240 start symbols
  • number of states: 1.4K -> 2K
  • number of actions: 83K -> 96K
  • size of table(bytes): 334K -> 388K

As discussed, the benefit is not obvious, let's leave it out now and revisit it when needed (added a comment in the cxx.bnf).

sammccall accepted this revision.May 11 2022, 10:40 AM
sammccall added inline comments.
clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h
192

This doesn't seem so bad to me, but I'd prefer it to be a method on Grammar I think - callers like the ones in the benchmark shouldn't have to worry about the GrammarTable object.

clang-tools-extra/pseudo/tool/ClangPseudo.cpp
118

this means if you pass --start-symbol=typo we'll hit an assertion

I think findNonterminal should probably return Optional instead?

hokein updated this revision to Diff 429645.May 16 2022, 1:33 AM

move the findNonterminal to Grammar.

This revision was landed with ongoing or failed builds.May 16 2022, 1:38 AM
This revision was automatically updated to reflect the committed changes.
hokein marked 2 inline comments as done.