This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang-tools-extra/pseudo/
-
pseudo/
-
CMakeLists.txt
-
gen/
-
CMakeLists.txt
-
Cxx.cmake
-
CxxGen.cpp
-
include/clang-pseudo/
-
clang-pseudo/
-
Grammar.h
-
LRTable.h
-
lib/
-
CMakeLists.txt

Differential D125231

[pseudo] Compile cxx grammar.
Needs ReviewPublic

Authored by hokein on May 9 2022, 7:15 AM.

Download Raw Diff

Details

Reviewers

sammccall

Summary

It compiles the cxx bnf grammar, and generates enum-type grammar symbols
and prebuilt LRTable for the pseudo-parser.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hokein created this revision.May 9 2022, 7:15 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 9 2022, 7:15 AM

Herald added a subscriber: mgorny. · View Herald Transcript

hokein requested review of this revision.May 9 2022, 7:15 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 9 2022, 7:15 AM

Herald added a subscriber: alextsao1999. · View Herald Transcript

Harbormaster completed remote builds in B163480: Diff 428079.May 9 2022, 7:16 AM

This is just a prototype, wanting some early feedback before making further progress:

compiling the generated Cxx.cpp is very slow (took minutes, mostly due to the LRTable::Actions);
layering (location of the generated header file) is not super clear;

Cool! High level thoughts:

Motivation
What's the win here? performance of loading/compiling the grammar? self-contained-ness? how much are we saving?
Doing the compilation at build time seems likely to be good for performance, but we should have numbers. And there are costs to doing it now vs in a few months, so there should be benefits too.

In terms of development, having everything grammar-related be compiled makes it harder to make changes to the grammar (e.g. constraints on rules, error hints, and whatever else we might want).
One way around this: start with a version where the public interface looks like this, but the impl is just compiling the strings in the background. (Probably a part of the generator that just embeds the lines of the grammar in a string array).Then opt-in the critical and stable bits at compile time, like the LR table build. Maybe eventually we'll have everything fully compiled, but it doesn't seem like the right balance yet.

Build performance
Compile time/ram is my biggest technical concern. We know that there's no good portable way to get blobs of data into the build system, and the parser is the bottleneck. (See the sad std::embed situation). 300KB should be manageable, with some work.
There are things we can do to mitigate:

for the big lists, use simple top-level declarations of arrays with types, not nested expressions ending up as std::initializer_lists. Just parsing the elements is going to be slow, but there's no reason to involve overload resolution, force them to all be in memory at once, etc.
split each big list into its own file for build-system level parallelism. In particular actions[] is ~half the data. This is easy once it's a top-level array.
not parse-time, but related: switch the classes to holding ArrayRefs rather than Vectors so we don't have to copy everything. (vector{1,2,3} is always a copy, it can't share the underlying static because it's mutable).
make the element types simpler to parse. Currently the Actions buffer is a vector<class Action> which at the bottom is two numbers packed into 16 bits. This could instead be a uint16_t Actions[]={...};, with LRTable wrapping the uint16_ts into Action objects on demand (at zero runtime cost).
if we just have an array of 16-bit integers, it's possible making that source file C would be faster to parse than C++ (fewer language features). I doubt it but...

Layering

the generator shouldn't depend on much, to avoid creating long paths in the build graph. It's pretty good now. I may be missing something, but does it need to link against clangBasic? If this is for functions like getTokenName() we could consider reimplementing them. (I think clangPseudoGrammar may be a better name than clangPseudoBasic).
similarly, the generated code should avoid dependencies particularly if it's slow to compile - maybe it doesn't need clangBasic either? (For our purposes it'd also be nice to avoid depending on any tablegenned headers).

Scope/generality
I think we're should plan to run this generator in exactly one configuration (i.e. for our C++ grammar), we shouldn't pay the costs of making it efficient to reuse.
This means we don't need to generate the whole library with complete API surface, just the bits that really need to be generated. Generated code is harder to maintain and browse.

in the header, this is just the symbol list (and later the rule list). the rest of the header can be hand-written.
we'd want a cpp file for each array (maybe group some together), but none of the initialization logic that uses them: we can just forward declare and use those arrays from hand-written C++

Maybe later we want to be able to compile a different grammar for C and we revisit this (but I think equally likely we work out how to express language options in terms of dynamically-disabling grammar rules instead).

hokein mentioned this in D125667: [pseudo] A basic implementation of compiling cxx grammar at build time..May 16 2022, 1:05 AM

Revision Contents

Path

Size

clang-tools-extra/

pseudo/

CMakeLists.txt

3 lines

gen/

CMakeLists.txt

10 lines

Cxx.cmake

21 lines

CxxGen.cpp

230 lines

include/

clang-pseudo/

Grammar.h

7 lines

LRTable.h

1 line

lib/

CMakeLists.txt

33 lines

Diff 428079

clang-tools-extra/pseudo/CMakeLists.txt

				set(CLANG_PSEUDO_BINARY_DIR ${CMAKE_CURRENT_BINARY_DIR})

	include_directories(include)			include_directories(include)
	include_directories(${CMAKE_CURRENT_BINARY_DIR}/include)			include_directories(${CMAKE_CURRENT_BINARY_DIR}/include)
	add_subdirectory(lib)			add_subdirectory(lib)
	add_subdirectory(tool)			add_subdirectory(tool)
	add_subdirectory(fuzzer)			add_subdirectory(fuzzer)
	add_subdirectory(benchmarks)			add_subdirectory(benchmarks)
				add_subdirectory(gen)
	if(CLANG_INCLUDE_TESTS)			if(CLANG_INCLUDE_TESTS)
	add_subdirectory(unittests)			add_subdirectory(unittests)
	add_subdirectory(test)			add_subdirectory(test)
	endif()			endif()

clang-tools-extra/pseudo/gen/CMakeLists.txt

This file was added.

				set(LLVM_LINK_COMPONENTS Support)

				add_clang_executable(pseudo-cxx-gen
				CxxGen.cpp
				)

				target_link_libraries(pseudo-cxx-gen
				PRIVATE
				clangPseudoBasic
				)

clang-tools-extra/pseudo/gen/Cxx.cmake

This file was added.

				# Compiles the BNF grammar file, and produces a pair of files called
				# ${filename}.h and ${filename}.cpp in the ${CLANG_PSEUDO_BINARY_DIR}.
				function(gen_cxx grammar_file filename)
				set(header_file ${CLANG_PSEUDO_BINARY_DIR}/${filename}.h)
				set(cpp_file ${CLANG_PSEUDO_BINARY_DIR}/${filename}.cpp)

				add_custom_command(OUTPUT ${header_file} ${cpp_file}
				COMMAND "${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/pseudo-cxx-gen"
				--grammar ${grammar_file}
				--output-dir ${CLANG_PSEUDO_BINARY_DIR}
				--filename ${filename}
				COMMENT "Generating code for cxx grammar..."
				DEPENDS pseudo-cxx-gen
				VERBATIM)

				set_source_files_properties(${header_file} PROPERTIES
				GENERATED 1)
				set_source_files_properties(${cpp_file} PROPERTIES
				GENERATED 1)

				endfunction()

clang-tools-extra/pseudo/gen/CxxGen.cpp

This file was added.

				//===-- CxxGen.cpp - Compile BNF grammar and LR table ---------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "clang-pseudo/Grammar.h"
				#include "clang-pseudo/LRGraph.h"
				#include "clang-pseudo/LRTable.h"
				#include "llvm/ADT/StringExtras.h"
				#include "llvm/Support/CommandLine.h"
				#include "llvm/Support/FormatVariadic.h"
				#include "llvm/Support/MemoryBuffer.h"
				#include <algorithm>

				using clang::pseudo::Grammar;
				using llvm::cl::desc;
				using llvm::cl::init;
				using llvm::cl::opt;

				static opt<std::string>
				Grammar("grammar", desc("Parse and check a BNF grammar file."), init(""));
				static opt<std::string>
				Filename("filename", desc("Output file name (without file extension)"),
				init("Cxx"));
				static opt<std::string> OutputDir("output-dir", desc("Output directory"),
				init(""));

				static std::string readOrDie(llvm::StringRef Path) {
				llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> Text =
				llvm::MemoryBuffer::getFile(Path);
				if (std::error_code EC = Text.getError()) {
				llvm::errs() << "Error: can't read grammar file '" << Path
				<< "': " << EC.message() << "\n";
				::exit(1);
				}
				return Text.get()->getBuffer().str();
				}

				static std::string genHeaderCode(const clang::pseudo::Grammar &G,
				llvm::StringRef Filename) {
				std::vector<std::string> NonterminalEnums;
				NonterminalEnums.reserve(G.table().Nonterminals.size());
				for (clang::pseudo::SymbolID ID = 0; ID < G.table().Nonterminals.size();
				++ID) {
				std::string Name = G.symbolName(ID).str();
				// translation-unit -> translation_unit
				std::replace(Name.begin(), Name.end(), '-', '_');
				NonterminalEnums.push_back(llvm::formatv(" {0} = {1}", Name, ID));
				}
				std::string HeaderGuard =
				llvm::formatv("GENERATED_CLANG_PSEUDO_{0}_H", Filename);
				return llvm::formatv(R"cpp(
				#ifndef {0}
				#define {0}

				#include "clang-pseudo/Grammar.h"
				#include "llvm/Support/Compiler.h"

				namespace clang {
				namespace pseudo {
				class LRTable;
				namespace cxx {

				enum Symbol : SymbolID {
				{1}
				};

				const Grammar& getGrammar();
				const LRTable& getLRTable();

				} // namespace cxx
				} // namespace pseudo
				} // namespace clang

				#endif // {0})cpp",
				HeaderGuard, llvm::join(NonterminalEnums, ",\n"));
				}

				template <typename Container>
				std::string genericJoin(const Container &C, llvm::StringRef Separator) {
				std::vector<std::string> Strings;
				for (const auto &E : C)
				Strings.push_back(llvm::formatv("{0}", E));
				return llvm::join(Strings, Separator);
				}

				static std::string genCppCode(const clang::pseudo::Grammar &G,
				llvm::StringRef Filename) {
				auto ToNames = [&](llvm::ArrayRef<clang::pseudo::SymbolID> Syms) {
				std::vector<std::string> Names;
				for (auto SID : Syms)
				Names.push_back(llvm::formatv("/{0}/{1}", G.symbolName(SID), SID));
				return Names;
				};
				std::vector<std::string> Rules;
				for (const auto &R : G.table().Rules) {
				Rules.push_back(llvm::formatv(" { /{0}/{1}, /Seq=/{ {2} } }",
				G.symbolName(R.Target), R.Target,
				llvm::join(ToNames(R.seq()), ", ")));
				}

				std::vector<std::string> Nonterminals;
				for (const auto &NT : G.table().Nonterminals) {
				Nonterminals.push_back(
				llvm::formatv(" { \"{0}\", {/Start/{1}, /End/{2} } }", NT.Name,
				NT.RuleRange.Start, NT.RuleRange.End));
				}
				std::vector<std::string> Terminals;
				for (const auto &T : G.table().Terminals) {
				Terminals.push_back(llvm::formatv(" \"{0}\"", T));
				}

				auto LRTable = clang::pseudo::LRTable::buildSLR(G);

				std::string LRNontermOffset = genericJoin(LRTable.NontermOffset, ", ");
				std::string LRTermOffsetCode = genericJoin(LRTable.TerminalOffset, ", ");
				std::string LRStates = genericJoin(LRTable.States, ", ");
				std::vector<std::string> LRActions;
				for (const auto &Action : LRTable.Actions) {
				switch (Action.kind()) {
				case clang::pseudo::LRTable::Action::Shift:
				LRActions.push_back(
				llvm::formatv("Action::shift({0})", Action.getShiftState()));
				break;
				case clang::pseudo::LRTable::Action::Reduce:
				LRActions.push_back(
				llvm::formatv("Action::reduce({0})", Action.getReduceRule()));
				break;
				case clang::pseudo::LRTable::Action::Accept:
				// FIXME: use a real RID here
				LRActions.push_back(llvm::formatv("Action::accept(0)"));
				break;
				case clang::pseudo::LRTable::Action::GoTo:
				LRActions.push_back(
				llvm::formatv("Action::goTo({0})", Action.getGoToState()));
				break;
				default:
				assert(false);
				break;
				}
				}
				std::vector<std::string> LRStartStates;
				for (const auto &SA : LRTable.StartStates) {
				LRStartStates.push_back(llvm::formatv(
				"{ /SymbolID/{0}, /StartState/{1} }", SA.first, SA.second));
				}
				return llvm::formatv(
				R"cpp(#include <memory>

				#include "{0}.h"
				#include "clang-pseudo/Grammar.h"
				#include "clang-pseudo/LRTable.h"

				namespace clang {
				namespace pseudo {

				namespace cxx {

				const Grammar& getGrammar() {
				static GrammarTable* Table = new GrammarTable({
				{ // Rules
				{1}
				}, // Rules
				{ // Nonterminals
				{2}
				}, // Nonterminals
				{ // Terminals
				{3}
				} // Terminals
				});
				static Grammar* G = new Grammar(std::unique_ptr<GrammarTable>(Table));
				return *G;
				}

				const LRTable& getLRTable() {
				using Action = LRTable::Action;
				static LRTable* Table = new LRTable({
				/NontermOffset=/{ {4} },
				/TermOffset=/{ {5} },
				/States=/{ {6} },
				/Actions=/{ {7} },
				/StartStates=/{ {8} },
				});
				return *Table;
				}

				} // namespace cxx
				} // namespace pseudo
				} // namespace clang
				)cpp",
				Filename, llvm::join(Rules, ",\n"), llvm::join(Nonterminals, ", "),
				llvm::join(Terminals, ", "), LRNontermOffset, LRTermOffsetCode, LRStates,
				llvm::join(LRActions, ", "), llvm::join(LRStartStates, ", "));
				}

				void writeFile(llvm::StringRef Filepath, llvm::StringRef Content) {
				std::error_code EC;
				llvm::raw_fd_ostream FD(llvm::StringRef(Filepath), EC);
				if (EC) {
				llvm::errs() << "Faile to open file: " << Filepath << ": " << EC.message();
				exit(1);
				}
				FD << Content;
				}

				int main(int argc, char *argv[]) {
				llvm::cl::ParseCommandLineOptions(argc, argv, "");
				if (!Grammar.getNumOccurrences()) {
				llvm::errs() << "Grammar file must be provided!\n";
				return 1;
				}

				std::string GrammarText = readOrDie(Grammar);
				std::vector<std::string> Diags;
				auto G = Grammar::parseBNF(GrammarText, Diags);

				if (!Diags.empty()) {
				llvm::errs() << llvm::join(Diags, "\n");
				return 1;
				}

				std::string HeaderPath = llvm::formatv("{0}/{1}.h", OutputDir, Filename);
				std::string CppPath = llvm::formatv("{0}/{1}.cpp", OutputDir, Filename);
				writeFile(HeaderPath, genHeaderCode(*G, Filename));
				writeFile(CppPath, genCppCode(*G, Filename));
				return 0;
				}

clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h

	Show First 20 Lines • Show All 147 Lines • ▼ Show 20 Lines
	// For each nonterminal X, computes the set of terminals that could immediately			// For each nonterminal X, computes the set of terminals that could immediately
	// follow X. (Known as FOLLOW sets in grammar-based parsers).			// follow X. (Known as FOLLOW sets in grammar-based parsers).
	std::vector<llvm::DenseSet<SymbolID>> followSets(const Grammar &);			std::vector<llvm::DenseSet<SymbolID>> followSets(const Grammar &);

	// Storage for the underlying data of the Grammar.			// Storage for the underlying data of the Grammar.
	// It can be constructed dynamically (from compiling BNF file) or statically			// It can be constructed dynamically (from compiling BNF file) or statically
	// (a compiled data-source).			// (a compiled data-source).
	struct GrammarTable {			struct GrammarTable {
	GrammarTable();

	struct Nonterminal {			struct Nonterminal {
	std::string Name;			std::string Name;
	// Corresponding rules that construct the nonterminal, it is a [Start, End)			// Corresponding rules that construct the nonterminal, it is a [Start, End)
	// index range of the Rules table.			// index range of the Rules table.
	struct {			struct {
	RuleID Start;			RuleID Start;
	RuleID End;			RuleID End;
	} RuleRange;			} RuleRange;
	};			};
				GrammarTable();
				GrammarTable(std::vector<Rule> Rules, std::vector<Nonterminal> Nonterminals,
				llvm::ArrayRef<std::string> Terminals)
				: Rules(std::move(Rules)), Terminals(Terminals),
				Nonterminals(std::move(Nonterminals)){};

	// RuleID is an index into this table of rule definitions.			// RuleID is an index into this table of rule definitions.
	//			//
	// Rules with the same target symbol (LHS) are grouped into a single range.			// Rules with the same target symbol (LHS) are grouped into a single range.
	// The relative order of different target symbols is not by SymbolID, but			// The relative order of different target symbols is not by SymbolID, but
	// rather a topological sort: if S := T then the rules producing T have lower			// rather a topological sort: if S := T then the rules producing T have lower
	// RuleIDs than rules producing S.			// RuleIDs than rules producing S.
	// (This strange order simplifies the GLR parser: for a given token range, if			// (This strange order simplifies the GLR parser: for a given token range, if
	Show All 15 Lines

clang-tools-extra/pseudo/include/clang-pseudo/LRTable.h

Show First 20 Lines • Show All 159 Lines • ▼ Show 20 Lines	public:
struct Entry {		struct Entry {
StateID State;		StateID State;
SymbolID Symbol;		SymbolID Symbol;
Action Act;		Action Act;
};		};
// Build a specifid table for testing purposes.		// Build a specifid table for testing purposes.
static LRTable buildForTests(const GrammarTable &, llvm::ArrayRef<Entry>);		static LRTable buildForTests(const GrammarTable &, llvm::ArrayRef<Entry>);

private:
// Conceptually the LR table is a multimap from (State, SymbolID) => Action.		// Conceptually the LR table is a multimap from (State, SymbolID) => Action.
// Our physical representation is quite different for compactness.		// Our physical representation is quite different for compactness.

// Index is nonterminal SymbolID, value is the offset into States/Actions		// Index is nonterminal SymbolID, value is the offset into States/Actions
// where the entries for this nonterminal begin.		// where the entries for this nonterminal begin.
// Give a nonterminal id, the corresponding half-open range of StateIdx is		// Give a nonterminal id, the corresponding half-open range of StateIdx is
// [NontermIdx[id], NontermIdx[id+1]).		// [NontermIdx[id], NontermIdx[id+1]).
std::vector<uint32_t> NontermOffset;		std::vector<uint32_t> NontermOffset;
Show All 17 Lines

clang-tools-extra/pseudo/lib/CMakeLists.txt

	set(LLVM_LINK_COMPONENTS Support)			set(LLVM_LINK_COMPONENTS Support)

	add_clang_library(clangPseudo			include(${CMAKE_CURRENT_SOURCE_DIR}/../gen/Cxx.cmake)
				set(CXX_GRAMMAR ${CMAKE_CURRENT_LIST_DIR}/cxx.bnf)
				gen_cxx(${CXX_GRAMMAR} "Cxx")

				# Needed by LLVM's CMake checks because this file defines multiple targets.
				set(LLVM_OPTIONAL_SOURCES
	DirectiveTree.cpp			DirectiveTree.cpp
	Forest.cpp			Forest.cpp
	GLR.cpp			GLR.cpp
	Grammar.cpp			Grammar.cpp
	GrammarBNF.cpp			GrammarBNF.cpp
	Lex.cpp			Lex.cpp
	LRGraph.cpp			LRGraph.cpp
	LRTable.cpp			LRTable.cpp
	LRTableBuild.cpp			LRTableBuild.cpp
	Token.cpp			Token.cpp
				)

				add_clang_library(clangPseudoBasic
				Grammar.cpp
				GrammarBNF.cpp
				LRGraph.cpp
				LRTable.cpp
				LRTableBuild.cpp

				LINK_LIBS
				clangBasic
				)

				add_clang_library(clangPseudo
				DirectiveTree.cpp
				Forest.cpp
				GLR.cpp
				Lex.cpp
				Token.cpp

	LINK_LIBS			LINK_LIBS
	clangBasic			clangBasic
	clangLex			clangLex
				clangPseudoBasic
				)

				add_clang_library(clangPseudoCXX
				${CLANG_PSEUDO_BINARY_DIR}/Cxx.cpp
				LINK_LIBS
				clangBasic
	)			)

This is an archive of the discontinued LLVM Phabricator instance.

[pseudo] Compile cxx grammar.Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 428079

clang-tools-extra/pseudo/CMakeLists.txt

clang-tools-extra/pseudo/gen/CMakeLists.txt

clang-tools-extra/pseudo/gen/Cxx.cmake

clang-tools-extra/pseudo/gen/CxxGen.cpp

clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h

clang-tools-extra/pseudo/include/clang-pseudo/LRTable.h

clang-tools-extra/pseudo/lib/CMakeLists.txt

[pseudo] Compile cxx grammar.
Needs ReviewPublic