This is an archive of the discontinued LLVM Phabricator instance.

[pseudo] Compile cxx grammar.
Needs ReviewPublic

Authored by hokein on May 9 2022, 7:15 AM.

Details

Reviewers
sammccall
Summary

It compiles the cxx bnf grammar, and generates enum-type grammar symbols
and prebuilt LRTable for the pseudo-parser.

Diff Detail

Event Timeline

hokein created this revision.May 9 2022, 7:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 9 2022, 7:15 AM
Herald added a subscriber: mgorny. · View Herald Transcript
hokein requested review of this revision.May 9 2022, 7:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 9 2022, 7:15 AM
hokein added a comment.May 9 2022, 7:18 AM

This is just a prototype, wanting some early feedback before making further progress:

  • compiling the generated Cxx.cpp is very slow (took minutes, mostly due to the LRTable::Actions);
  • layering (location of the generated header file) is not super clear;

Cool! High level thoughts:

Motivation
What's the win here? performance of loading/compiling the grammar? self-contained-ness? how much are we saving?
Doing the compilation at build time seems likely to be good for performance, but we should have numbers. And there are costs to doing it now vs in a few months, so there should be benefits too.

In terms of development, having everything grammar-related be compiled makes it harder to make changes to the grammar (e.g. constraints on rules, error hints, and whatever else we might want).
One way around this: start with a version where the public interface looks like this, but the impl is just compiling the strings in the background. (Probably a part of the generator that just embeds the lines of the grammar in a string array).Then opt-in the critical and stable bits at compile time, like the LR table build. Maybe eventually we'll have everything fully compiled, but it doesn't seem like the right balance yet.

Build performance
Compile time/ram is my biggest technical concern. We know that there's no good portable way to get blobs of data into the build system, and the parser is the bottleneck. (See the sad std::embed situation). 300KB should be manageable, with some work.
There are things we can do to mitigate:

  • for the big lists, use simple top-level declarations of arrays with types, not nested expressions ending up as std::initializer_lists. Just parsing the elements is going to be slow, but there's no reason to involve overload resolution, force them to all be in memory at once, etc.
  • split each big list into its own file for build-system level parallelism. In particular actions[] is ~half the data. This is easy once it's a top-level array.
  • not parse-time, but related: switch the classes to holding ArrayRefs rather than Vectors so we don't have to copy everything. (vector{1,2,3} is always a copy, it can't share the underlying static because it's mutable).
  • make the element types simpler to parse. Currently the Actions buffer is a vector<class Action> which at the bottom is two numbers packed into 16 bits. This could instead be a uint16_t Actions[]={...};, with LRTable wrapping the uint16_ts into Action objects on demand (at zero runtime cost).
  • if we just have an array of 16-bit integers, it's possible making that source file C would be faster to parse than C++ (fewer language features). I doubt it but...

Layering

  • the generator shouldn't depend on much, to avoid creating long paths in the build graph. It's pretty good now. I may be missing something, but does it need to link against clangBasic? If this is for functions like getTokenName() we could consider reimplementing them. (I think clangPseudoGrammar may be a better name than clangPseudoBasic).
  • similarly, the generated code should avoid dependencies particularly if it's slow to compile - maybe it doesn't need clangBasic either? (For our purposes it'd also be nice to avoid depending on any tablegenned headers).

Scope/generality
I think we're should plan to run this generator in exactly one configuration (i.e. for our C++ grammar), we shouldn't pay the costs of making it efficient to reuse.
This means we don't need to generate the whole library with complete API surface, just the bits that really need to be generated. Generated code is harder to maintain and browse.

  • in the header, this is just the symbol list (and later the rule list). the rest of the header can be hand-written.
  • we'd want a cpp file for each array (maybe group some together), but none of the initialization logic that uses them: we can just forward declare and use those arrays from hand-written C++

Maybe later we want to be able to compile a different grammar for C and we revisit this (but I think equally likely we work out how to express language options in terms of dynamically-disabling grammar rules instead).