This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang-tools-extra/pseudo/
-
pseudo/
-
include/clang-pseudo/
-
clang-pseudo/
11/18
Grammar.h
-
lib/grammar/
-
grammar/
-
Grammar.cpp
11/21
GrammarBNF.cpp
-
unittests/
1/1
GrammarTest.cpp

Differential D126536

[pseudo] Add grammar annotations support.
ClosedPublic

Authored by hokein on May 27 2022, 6:28 AM.

Download Raw Diff

Details

Reviewers

sammccall

Commits

rGf1ac00c9b0d1: [pseudo] Add grammar annotations support.

Summary

Add annotation handling ([key=value]) in the BNF grammar parser;
Define and setup the API in the grammar for attributes;
Implement a builtin guard for two simple c++ contexual-override/final use cases;

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hokein created this revision.May 27 2022, 6:28 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 27 2022, 6:28 AM

Herald added subscribers: mgrang, mgorny. · View Herald Transcript

hokein requested review of this revision.May 27 2022, 6:28 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 27 2022, 6:28 AM

Herald added a subscriber: alextsao1999. · View Herald Transcript

This patch might cover too many things (at least we could split the guide/glr implementation bit), but want to give you an overview of the picture first.

hokein added inline comments.May 27 2022, 6:40 AM

clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h
88	new names are welcome. Attribute is the name I came up with (I think it is clearer than the original `Hook`),
95	I'm not quite happy with using the value as the ID, I think we can encode the Key into the ID as well (ID := Key \| Value). Similar to the generated enum name, currently we just use the name of Value (`Override`), it will be more confusing when we add more keys/values, one idea is to add key as well (`GuardOverride` etc?).

Harbormaster completed remote builds in B166634: Diff 432538.May 27 2022, 6:50 AM

Nice!

This has tests for the parsing-the-attribute bits, but I think we're missing tests for the actual guards added.

clang-tools-extra/pseudo/include/clang-pseudo/GLR.h
114 ↗	(On Diff #432538)	this signature seems a little off to me. Guard logic is identified by a guard ID and we look it up, but we're not passing in the guard ID for some reason. Instead we pass in the rule ID, which this function uses to look up the guard ID again. Why not just pass the guard ID? That said, it's a bit surprising that we have the rules declare separate guard rules, but then they're all implemented by one function. A map of functions seems more natural. (but not a performance advantage I guess) Naming: it's confusing that this umbrella function is called "guard", but a specific rule like "guard=Override" is also called "guard". I'd either call this a GuardTester or bypass the issue by making this a DenseMap whose values are Guards. Altogether I might write this as: using Guard = llvm::function_ref<bool(llvm::ArrayRef<const ForestNode *> RHS, const TokenStream &, const Grammar &)>; ... const DenseMap<AttributeID, Guard> &Guards;
129 ↗	(On Diff #432538)	either a reference, or a lightweight reference type like function_ref, for consistency with other fields?
clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h
22	nit: first "guard" should be "key"?
88	I don't understand why we need AttributeKey, and to parse things into it, rather than just using StringRef. (Also here you've put it in the header, but it's not used in any interfaces)
95	Packing all the possible values into a single unstructured enum is weird, but the problem (if it's a problem) is _redundancy_ and packing the attribute name in there as well just makes that worse. If we want to fix this, I think a sounder fix is something like: // Grammar.h using AttributeEnum = uint8_t; struct Rule { ...; AttributeEnum Guard = 0; } // CXX.h enum class GuardID : AttributeEnum { Override = 1; }; enum class RecoveryID : AttributeEnum { Parens = 1; }; i.e. we keep track of the unique values per attribute name and emit an enum for each attribute. This probably means emitting the actual enum definitions in tablegen, rather than a generic .inc file. (But I'm also fine with not fixing it at this point, and keeping one flat enum for the values)
96	I agree HookID is an unfortunately vague name, but I did consider and reject AttributeID which I think is actively confusing. In the plain-english meaning, "guard" is an attribute, and "override" is its value. So assigning "override" an "attribute ID" is very confusing. The difficulty of naming this is that we can't refer to the semantics (e.g. "Guard") as we want a single namespace for all of them, and all they have in common is syntax. However referring to "attribute values" makes for confusing internal data structures, as we don't model the whole attribute generically. So we're left with names that suggest "this is some kind of anchor to attach custom code to". Maybe `ExtensionID` is a little clearer than `HookID`?
98	If we're going to use implicit bool conversions, we should not pretend this is some arbitrary value and drop the constant I think.
102	if this optional and rarely set, I think it's should be a constructor parameter - this constructor could become unwieldy. It also forces you to process the attributes before creating the Rule, and reading the code doing it afterwards seems more natural.
clang-tools-extra/pseudo/include/clang-pseudo/cxx/CXX.h
39 ↗	(On Diff #432538)	this pattern is fragile. (and we're going to end up adding rules, too) I'd suggest either: generate a separate file for each enum generate a file containing the full definiton of all the enums (we're not using the generality) add the fallback #defines to the .inc file so the caller doesn't have to provide them all
63 ↗	(On Diff #432538)	"guard" looks like a verb here. If we're going to use it as both noun and verb, the relationship between the two needs to match the metaphor (in this case, it needs to be the guard who is doing the guarding) and it's not clear that it is. I think safest to stick to `const Guard &getGuard()` or so.
clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp
47	This patch takes this function from 80->120 lines, and it's starting to feel unwieldy. I think moving building the UniqueNonterminals/SymbolIDs + the new maps into a function (which returns a struct of SymbolIds and AttrValueIDs) would make this more manageable.
51	unused
52	why ""?
60	AttrValueIDs never seems to get populated, so I'm not sure this works.
104	This doesn't seem worth diagnosing to me - it seems stylistically weird but not really a problem to put a guard wherever.
104	maybe another function to pull out here? applyAttribute(StringRef Name, StringRef Value, AttributeID ValueID, Rule&, Element&)
227	Again, I don't think we should be aiming to provide nice diagnostics for each possible way we could get this wrong - the grammar here is part of the pseudo-parser, not user input.
258	(This seems like a condition we can diagnose when applying the attribute, instead of needing to do it eagerly here)
clang-tools-extra/pseudo/unittests/GLRTest.cpp
161 ↗	(On Diff #432538)	These empty streams aren't valid (in multiple ways: they're not finalized, and they don't contain the tokens that the nodes refer to). Maybe add a tokenStreamForTest(StringLiteral) function somewhere? Safe because guaranteed to be null-terminated and have sufficient lifetime...

address the grammar part comments:

Rename to ExtensionID;
Simplify the BNF parsing bit;

I addressed comments on the grammar part, (the remaining GLR, pseudo_gen parts are not covered yet), I think it would be better them into different patches, so that we can land the grammar bit first, then start doing the error recovery and guard implementation in parallel.

clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h
102	I suppose you mean we should drop the Guard parameter in the constructor, yeah, that sounds good to me, and simplifies the BNF parsing bit.
clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp
52	This was for the sentinel default 0 attribute value.
60	oops, this part of code wasn't fully cleaned up. AttrValueIDs is not needed indeed.
104	sounds good, move the extension bits to a dedicate `applyExtension`.
104	This doesn't seem worth diagnosing to me - it seems stylistically weird but not really a problem to put a guard wherever. It is valid in syntax level, but not in semantic level, I think this is confusing -- the bnf parser doesn't emit any warning on this usage, just silently ignore them (grammar writer might think the grammar is correct, guard should work even putting it in the middle of the rules). On the other hand, as you mentioned, grammar file is not user-faced, and we're the current authors, it seems ok to not do it (in favour of simplicity) right now (we might need to reconsider improving it there are new grammar contributors).
227	fair enough, but we should keep minimal critical diagnostics at least to avoid shooting ourself in the foot.

remove a left-out change.

Harbormaster completed remote builds in B167232: Diff 433354.Jun 1 2022, 4:54 AM

Thanks, this looks better.
Sorry about confusion with the names - I think annotation is a great name, and we should only use "extension" for the narrower "opaque thing that an annotation value refers to".
Happy to chat about this offline if you like

clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h
22	Hmm, I think we misunderstood each other... I think annotations is a good name for the syntactic [foo=bar] bit. I think the comment here is now harder to follow, and would prefer the old version. It's specifically "bar" which indicates some identifier whose semantics will be provided externally, so "bar" is a reference to an extension. If you want to mention the concept of an extension, then maybe at the end... `Each unique annotation value creates an extension point. The Override guard is implemented in CXX.cpp by binding the ExtensionID for Override to a C++ function that performs the check.`
86	"an extension uniquely identifies an extension" is a tautology. An extension is a piece of native code specific to a grammar that modifies the behavior of annotated rules. One ExtensionID is assigned for each unique attribute value (all attributes share a namespace).
88	I don't think this last sentence is useful as-is, it looks important but it's not clear who it constrains. Also "attribute" rather than "key", an attribute is a thing that objects have, a key is a thing maps have.
213	attribute values, or extension names (attributes are syntactic and have string values, extensions are semantic and can be referred to by names).
clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp
105	doing this outside the loop above means this function has to handle all specs/rules and work out how they correspond, which is fragile. Instead can we do it inside the loop, and just handle a single attribute at a time? for (const auto &Spec : Specs) { // ... create rule for (const auto &Attribute : Spec.Attributes) applyAttribute(Attribute, T->Rules.back()); }
184	rather attributes :-(
243	For simplicity we could also drop this loop. If we ever need to specify two attributes on the same token, `a := b [foo] [bar]` works fine
246	If we drop this line, then [foo] will be equivalent to [foo=], i.e. attribute is present and points at the empty string. This seems reasonable to me, and is a good syntax for boolean attributes (where [foo] denotes set and no attribute denotes unset)
clang-tools-extra/pseudo/unittests/GrammarTest.cpp
104	if you like you could add 3 more rules, with no annotation, guard=override, guard=somethingelse and verify they're zero, equal, nonequal respectively

bring back the attribute concept, narrow down the ExtensionID scope (only used for semantic);
loose and simplify the BNF annotations parsing; ([] only allows single attribute, attribute without value are acceptable);
address other comments;

hokein added inline comments.Jun 7 2022, 11:58 AM

clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h
22	oops, I misunderstood your "extensionID" comment completely (sorry). Bring it back now.
213	renamed to attribute values.
clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp
246	This syntax for bool-type attributes seems good to me. Note that there is a subtle difference with the string-type attributes, where we use the empty string as the sentinel attribute value (denote unset) whose ExtensionID is 0. I think it should not be a big issue (we only handle them is in the bnf parsing time, and propagate field values in `Rule`).

Harbormaster completed remote builds in B168373: Diff 434906.Jun 7 2022, 12:16 PM

sammccall accepted this revision.Jun 8 2022, 2:34 PM

sammccall added inline comments.

clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h
89	nit: I'd drop this sentence, as using it to index into a table is only ever used to produce debug representations I think - usually we use ExtensionID as an index into a map
215	looking at this again, ExtensionNames seems clearer as ExtensionNames[ExtID] seems more obvious that the kinds agree. But up to you

This revision is now accepted and ready to land.Jun 8 2022, 2:34 PM

hokein marked an inline comment as done.Jun 9 2022, 3:03 AM

hokein added inline comments.

clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h
215	I considered extensionNames, but decided to use AttributeValues. `AttributeValues` corresponds to the syntactic grammar `[key=value]` where we call value as the attribute value, while using `ExtensionNames` is less clearer. And yeah `AttributeValues[ExtID]` is somehow confusing, but I think it is fine, as it is just print code.

This revision was landed with ongoing or failed builds.Jun 9 2022, 3:08 AM

Closed by commit rGf1ac00c9b0d1: [pseudo] Add grammar annotations support. (authored by hokein). · Explain Why

This revision was automatically updated to reflect the committed changes.

hokein added a commit: rGf1ac00c9b0d1: [pseudo] Add grammar annotations support..

uabelho added a subscriber: uabelho.Jun 9 2022, 4:46 AM

uabelho added inline comments.

clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp

234

I get a warning/error on this line with this commit:

13:31:17 ../../clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp:235:36: error: missing field 'Attributes' initializer [-Werror,-Wmissing-field-initializers]
13:31:17       Out.Sequence.push_back({Chunk});
13:31:17                                    ^
13:31:17 1 error generated.

I see the warning when compiling with clang 8.0.

hokein added inline comments.Jun 9 2022, 5:13 AM

clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp
234	sorry (I don't this warning enabled), should be fixed in 9ce232fba99c47c3246f06fcbe37c24b9d90585f.

uabelho added inline comments.Jun 9 2022, 5:30 AM

clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp
234	Thanks!

Revision Contents

Path

Size

clang-tools-extra/

pseudo/

include/

clang-pseudo/

Grammar.h

29 lines

lib/

grammar/

Grammar.cpp

2 lines

GrammarBNF.cpp

71 lines

unittests/

GrammarTest.cpp

13 lines

Diff 433354

clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h

Show All 13 Lines
// translation-unit := declaration-seq_opt		// translation-unit := declaration-seq_opt
// declaration-seq := declaration		// declaration-seq := declaration
// declaration-seq := declaration-seq declaration		// declaration-seq := declaration-seq declaration
//		//
// A grammar formally describes a language, and it is constructed by a set of		// A grammar formally describes a language, and it is constructed by a set of
// production rules. A rule is of BNF form (AAA := BBB CCC). A symbol is either		// production rules. A rule is of BNF form (AAA := BBB CCC). A symbol is either
// nonterminal or terminal, identified by a SymbolID.		// nonterminal or terminal, identified by a SymbolID.
//		//
		// The grammar supports extensions, which have the syntax form of
		sammccallUnsubmitted Not Done Reply Inline Actions nit: first "guard" should be "key"? sammccall: nit: first "guard" should be "key"?
		sammccallUnsubmitted Not Done Reply Inline Actions Hmm, I think we misunderstood each other... I think annotations is a good name for the syntactic [foo=bar] bit. I think the comment here is now harder to follow, and would prefer the old version. It's specifically "bar" which indicates some identifier whose semantics will be provided externally, so "bar" is a reference to an extension. If you want to mention the concept of an extension, then maybe at the end... `Each unique annotation value creates an extension point. The Override guard is implemented in CXX.cpp by binding the ExtensionID for Override to a C++ function that performs the check.` sammccall: Hmm, I think we misunderstood each other... I think annotations is a good name for the…
		hokeinAuthorUnsubmitted Done Reply Inline Actions oops, I misunderstood your "extensionID" comment completely (sorry). Bring it back now. hokein: oops, I misunderstood your "extensionID" comment completely (sorry). Bring it back now.
		// [key=value;key=value]. Extensions are associated with a grammar symbol (
		// on the right-hand side of the symbol) or a grammar rule (at the end of the
		// rule body).
		//
		// Extensions provide a way to inject custom code into the general GLR
		// parser. For example, we have a rule in the C++ grammar:
		//
		// contextual-override := IDENTIFIER [guard=Override]
		//
		// This rule is guarded -- the reduction of this rule will be conducted by the
		// GLR parser only if the IDENTIFIER content is `override` (see detail
		// implementationin CXX.h).
		//
// Notions about the BNF grammar:		// Notions about the BNF grammar:
// - "_" is the start symbol of the augmented grammar;		// - "_" is the start symbol of the augmented grammar;
// - single-line comment is supported, starting with a #		// - single-line comment is supported, starting with a #
// - A rule describes how a nonterminal (left side of :=) is constructed, and		// - A rule describes how a nonterminal (left side of :=) is constructed, and
// it is per line in the grammar file		// it is per line in the grammar file
// - Terminals (also called tokens) correspond to the clang::TokenKind; they		// - Terminals (also called tokens) correspond to the clang::TokenKind; they
// are written in the grammar like "IDENTIFIER", "USING", "+"		// are written in the grammar like "IDENTIFIER", "USING", "+"
// - Nonterminals are specified with "lower-case" names in the grammar; they		// - Nonterminals are specified with "lower-case" names in the grammar; they
Show All 34 Lines	inline tok::TokenKind symbolToToken(SymbolID SID) {
SID &= ~TokenFlag;		SID &= ~TokenFlag;
assert(SID < NumTerminals);		assert(SID < NumTerminals);
return static_cast<tok::TokenKind>(SID);		return static_cast<tok::TokenKind>(SID);
}		}
inline SymbolID tokenSymbol(tok::TokenKind TK) {		inline SymbolID tokenSymbol(tok::TokenKind TK) {
return TokenFlag \| static_cast<SymbolID>(TK);		return TokenFlag \| static_cast<SymbolID>(TK);
}		}

		// An extension uniquely identifies an extension in a grammar.
		sammccallUnsubmitted Done Reply Inline Actions "an extension uniquely identifies an extension" is a tautology. An extension is a piece of native code specific to a grammar that modifies the behavior of annotated rules. One ExtensionID is assigned for each unique attribute value (all attributes share a namespace). sammccall: "an extension uniquely identifies an extension" is a tautology. ``` An extension is a piece of…
		// It is the index into a table of extension values.
		// NOTE: value among extensions must be unique even within different keys!
		hokeinAuthorUnsubmitted Not Done Reply Inline Actions new names are welcome. Attribute is the name I came up with (I think it is clearer than the original `Hook`), hokein: new names are welcome. Attribute is the name I came up with (I think it is clearer than the…
		sammccallUnsubmitted Not Done Reply Inline Actions I don't understand why we need AttributeKey, and to parse things into it, rather than just using StringRef. (Also here you've put it in the header, but it's not used in any interfaces) sammccall: I don't understand why we need AttributeKey, and to parse things into it, rather than just…
		sammccallUnsubmitted Not Done Reply Inline Actions I don't think this last sentence is useful as-is, it looks important but it's not clear who it constrains. Also "attribute" rather than "key", an attribute is a thing that objects have, a key is a thing maps have. sammccall: I don't think this last sentence is useful as-is, it looks important but it's not clear who it…
		using ExtensionID = uint16_t;
		sammccallUnsubmitted Done Reply Inline Actions nit: I'd drop this sentence, as using it to index into a table is only ever used to produce debug representations I think - usually we use ExtensionID as an index into a map sammccall: nit: I'd drop this sentence, as using it to index into a table is only ever used to produce…

// A RuleID uniquely identifies a production rule in a grammar.		// A RuleID uniquely identifies a production rule in a grammar.
// It is an index into a table of rules.		// It is an index into a table of rules.
using RuleID = uint16_t;		using RuleID = uint16_t;
// There are maximum 2^12 rules.		// There are maximum 2^12 rules.
static constexpr unsigned RuleBits = 12;		static constexpr unsigned RuleBits = 12;
		hokeinAuthorUnsubmitted Done Reply Inline Actions I'm not quite happy with using the value as the ID, I think we can encode the Key into the ID as well (ID := Key \| Value). Similar to the generated enum name, currently we just use the name of Value (`Override`), it will be more confusing when we add more keys/values, one idea is to add key as well (`GuardOverride` etc?). hokein: I'm not quite happy with using the value as the ID, I think we can encode the Key into the ID…
		sammccallUnsubmitted Not Done Reply Inline Actions Packing all the possible values into a single unstructured enum is weird, but the problem (if it's a problem) is _redundancy_ and packing the attribute name in there as well just makes that worse. If we want to fix this, I think a sounder fix is something like: // Grammar.h using AttributeEnum = uint8_t; struct Rule { ...; AttributeEnum Guard = 0; } // CXX.h enum class GuardID : AttributeEnum { Override = 1; }; enum class RecoveryID : AttributeEnum { Parens = 1; }; i.e. we keep track of the unique values per attribute name and emit an enum for each attribute. This probably means emitting the actual enum definitions in tablegen, rather than a generic .inc file. (But I'm also fine with not fixing it at this point, and keeping one flat enum for the values) sammccall: Packing all the possible values into a single unstructured enum is weird, but the problem (if…

		sammccallUnsubmitted Done Reply Inline Actions I agree HookID is an unfortunately vague name, but I did consider and reject AttributeID which I think is actively confusing. In the plain-english meaning, "guard" is an attribute, and "override" is its value. So assigning "override" an "attribute ID" is very confusing. The difficulty of naming this is that we can't refer to the semantics (e.g. "Guard") as we want a single namespace for all of them, and all they have in common is syntax. However referring to "attribute values" makes for confusing internal data structures, as we don't model the whole attribute generically. So we're left with names that suggest "this is some kind of anchor to attach custom code to". Maybe `ExtensionID` is a little clearer than `HookID`? sammccall: I agree HookID is an unfortunately vague name, but I did consider and reject AttributeID which…
// Represent a production rule in the grammar, e.g.		// Represent a production rule in the grammar, e.g.
// expression := a b c		// expression := a b c
		sammccallUnsubmitted Done Reply Inline Actions If we're going to use implicit bool conversions, we should not pretend this is some arbitrary value and drop the constant I think. sammccall: If we're going to use implicit bool conversions, we should not pretend this is some arbitrary…
// ^Target ^Sequence		// ^Target ^Sequence
struct Rule {		struct Rule {
Rule(SymbolID Target, llvm::ArrayRef<SymbolID> Seq);		Rule(SymbolID Target, llvm::ArrayRef<SymbolID> Seq);

		sammccallUnsubmitted Done Reply Inline Actions if this optional and rarely set, I think it's should be a constructor parameter - this constructor could become unwieldy. It also forces you to process the attributes before creating the Rule, and reading the code doing it afterwards seems more natural. sammccall: if this optional and rarely set, I think it's should be a constructor parameter - this…
		hokeinAuthorUnsubmitted Done Reply Inline Actions I suppose you mean we should drop the Guard parameter in the constructor, yeah, that sounds good to me, and simplifies the BNF parsing bit. hokein: I suppose you mean we should drop the Guard parameter in the constructor, yeah, that sounds…
// We occupy 4 bits for the sequence, in theory, it can be at most 2^4 tokens		// We occupy 4 bits for the sequence, in theory, it can be at most 2^4 tokens
// long, however, we're stricter in order to reduce the size, we limit the max		// long, however, we're stricter in order to reduce the size, we limit the max
// length to 9 (this is the longest sequence in cxx grammar).		// length to 9 (this is the longest sequence in cxx grammar).
static constexpr unsigned SizeBits = 4;		static constexpr unsigned SizeBits = 4;
static constexpr unsigned MaxElements = 9;		static constexpr unsigned MaxElements = 9;
static_assert(MaxElements <= (1 << SizeBits), "Exceeds the maximum limit");		static_assert(MaxElements <= (1 << SizeBits), "Exceeds the maximum limit");
static_assert(SizeBits + SymbolBits <= 16,		static_assert(SizeBits + SymbolBits <= 16,
"Must be able to store symbol ID + size efficiently");		"Must be able to store symbol ID + size efficiently");

// 16 bits for target symbol and size of sequence:		// 16 bits for target symbol and size of sequence:
// SymbolID : 12 \| Size : 4		// SymbolID : 12 \| Size : 4
SymbolID Target : SymbolBits;		SymbolID Target : SymbolBits;
uint8_t Size : SizeBits; // Size of the Sequence		uint8_t Size : SizeBits; // Size of the Sequence
SymbolID Sequence[MaxElements];		SymbolID Sequence[MaxElements];

		// A guard extension controls whether a reduction of a rule will be conducted
		// by the GLR parser.
		// 0 is sentinel extension ID, indicating no extensions.
		ExtensionID Guard = 0;

llvm::ArrayRef<SymbolID> seq() const {		llvm::ArrayRef<SymbolID> seq() const {
return llvm::ArrayRef<SymbolID>(Sequence, Size);		return llvm::ArrayRef<SymbolID>(Sequence, Size);
}		}
friend bool operator==(const Rule &L, const Rule &R) {		friend bool operator==(const Rule &L, const Rule &R) {
return L.Target == R.Target && L.seq() == R.seq();		return L.Target == R.Target && L.seq() == R.seq() && L.Guard == R.Guard;
}		}
};		};

struct GrammarTable;		struct GrammarTable;

// Grammar that describes a programming language, e.g. C++. It represents the		// Grammar that describes a programming language, e.g. C++. It represents the
// contents of the specified grammar.		// contents of the specified grammar.
// It is a building block for constructing a table-based parser.		// It is a building block for constructing a table-based parser.
▲ Show 20 Lines • Show All 69 Lines • ▼ Show 20 Lines	struct GrammarTable {
// prerequisite reductions are reached before dependent ones).		// prerequisite reductions are reached before dependent ones).
std::vector<Rule> Rules;		std::vector<Rule> Rules;
// A table of terminals (aka tokens). It corresponds to the clang::Token.		// A table of terminals (aka tokens). It corresponds to the clang::Token.
// clang::tok::TokenKind is the index of the table.		// clang::tok::TokenKind is the index of the table.
llvm::ArrayRef<std::string> Terminals;		llvm::ArrayRef<std::string> Terminals;
// A table of nonterminals, sorted by name.		// A table of nonterminals, sorted by name.
// SymbolID is the index of the table.		// SymbolID is the index of the table.
std::vector<Nonterminal> Nonterminals;		std::vector<Nonterminal> Nonterminals;
		// A table of extensions values, sorted by name.
		sammccallUnsubmitted Done Reply Inline Actions attribute values, or extension names (attributes are syntactic and have string values, extensions are semantic and can be referred to by names). sammccall: attribute values, or extension names (attributes are syntactic and have string values…
		hokeinAuthorUnsubmitted Done Reply Inline Actions renamed to attribute values. hokein: renamed to attribute values.
		// ExtensionID is the index of the table.
		std::vector<std::string> Extensions;
		sammccallUnsubmitted Not Done Reply Inline Actions looking at this again, ExtensionNames seems clearer as ExtensionNames[ExtID] seems more obvious that the kinds agree. But up to you sammccall: looking at this again, ExtensionNames seems clearer as ExtensionNames[ExtID] seems more obvious…
		hokeinAuthorUnsubmitted Done Reply Inline Actions I considered extensionNames, but decided to use AttributeValues. `AttributeValues` corresponds to the syntactic grammar `[key=value]` where we call value as the attribute value, while using `ExtensionNames` is less clearer. And yeah `AttributeValues[ExtID]` is somehow confusing, but I think it is fine, as it is just print code. hokein: I considered extensionNames, but decided to use AttributeValues. `AttributeValues` corresponds…
};		};

} // namespace pseudo		} // namespace pseudo
} // namespace clang		} // namespace clang

#endif // CLANG_PSEUDO_GRAMMAR_H		#endif // CLANG_PSEUDO_GRAMMAR_H

clang-tools-extra/pseudo/lib/grammar/Grammar.cpp

	Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines

	std::string Grammar::dumpRule(RuleID RID) const {			std::string Grammar::dumpRule(RuleID RID) const {
	std::string Result;			std::string Result;
	llvm::raw_string_ostream OS(Result);			llvm::raw_string_ostream OS(Result);
	const Rule &R = T->Rules[RID];			const Rule &R = T->Rules[RID];
	OS << symbolName(R.Target) << " :=";			OS << symbolName(R.Target) << " :=";
	for (SymbolID SID : R.seq())			for (SymbolID SID : R.seq())
	OS << " " << symbolName(SID);			OS << " " << symbolName(SID);
				if (R.Guard)
				OS << " [guard=" << T->Extensions[R.Guard] << "]";
	return Result;			return Result;
	}			}

	std::string Grammar::dumpRules(SymbolID SID) const {			std::string Grammar::dumpRules(SymbolID SID) const {
	assert(isNonterminal(SID));			assert(isNonterminal(SID));
	std::string Result;			std::string Result;
	const auto &Range = T->Nonterminals[SID].RuleRange;			const auto &Range = T->Nonterminals[SID].RuleRange;
	for (RuleID RID = Range.Start; RID < Range.End; ++RID)			for (RuleID RID = Range.Start; RID < Range.End; ++RID)
	▲ Show 20 Lines • Show All 116 Lines • Show Last 20 Lines

clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp

Show All 38 Lines	assert(llvm::all_of(Specs,
R.Sequence, [](const RuleSpec::Element &E) {		R.Sequence, [](const RuleSpec::Element &E) {
return !E.Symbol.endswith(OptSuffix);		return !E.Symbol.endswith(OptSuffix);
});		});
}) &&		}) &&
"Optional symbols should be eliminated!");		"Optional symbols should be eliminated!");

auto T = std::make_unique<GrammarTable>();		auto T = std::make_unique<GrammarTable>();

// Assemble the name->ID and ID->nonterminal name maps.		// Assemble the name->ID and ID->nonterminal name maps.
		sammccallUnsubmitted Not Done Reply Inline Actions This patch takes this function from 80->120 lines, and it's starting to feel unwieldy. I think moving building the UniqueNonterminals/SymbolIDs + the new maps into a function (which returns a struct of SymbolIds and AttrValueIDs) would make this more manageable. sammccall: This patch takes this function from 80->120 lines, and it's starting to feel unwieldy. I think…
llvm::DenseSet<llvm::StringRef> UniqueNonterminals;		llvm::DenseSet<llvm::StringRef> UniqueNonterminals;
llvm::DenseMap<llvm::StringRef, SymbolID> SymbolIds;		llvm::DenseMap<llvm::StringRef, SymbolID> SymbolIds;

		llvm::DenseSet<llvm::StringRef> UniqueExtensionValues;
		sammccallUnsubmitted Done Reply Inline Actions unused sammccall: unused

		sammccallUnsubmitted Not Done Reply Inline Actions why ""? sammccall: why ""?
		hokeinAuthorUnsubmitted Done Reply Inline Actions This was for the sentinel default 0 attribute value. hokein: This was for the sentinel default 0 attribute value.
for (uint16_t I = 0; I < NumTerminals; ++I)		for (uint16_t I = 0; I < NumTerminals; ++I)
SymbolIds.try_emplace(T->Terminals[I], tokenSymbol(tok::TokenKind(I)));		SymbolIds.try_emplace(T->Terminals[I], tokenSymbol(tok::TokenKind(I)));
auto Consider = [&](llvm::StringRef Name) {		auto Consider = [&](llvm::StringRef Name) {
if (!SymbolIds.count(Name))		if (!SymbolIds.count(Name))
UniqueNonterminals.insert(Name);		UniqueNonterminals.insert(Name);
};		};
for (const auto &Spec : Specs) {		for (const auto &Spec : Specs) {
Consider(Spec.Target);		Consider(Spec.Target);
		sammccallUnsubmitted Not Done Reply Inline Actions AttrValueIDs never seems to get populated, so I'm not sure this works. sammccall: AttrValueIDs never seems to get populated, so I'm not sure this works.
		hokeinAuthorUnsubmitted Done Reply Inline Actions oops, this part of code wasn't fully cleaned up. AttrValueIDs is not needed indeed. hokein: oops, this part of code wasn't fully cleaned up. AttrValueIDs is not needed indeed.
for (const RuleSpec::Element &Elt : Spec.Sequence)		for (const RuleSpec::Element &Elt : Spec.Sequence) {
Consider(Elt.Symbol);		Consider(Elt.Symbol);
		for (const auto& KV : Elt.Extensions)
		UniqueExtensionValues.insert(KV.second);
		}
}		}
llvm::for_each(UniqueNonterminals, [&T](llvm::StringRef Name) {		llvm::for_each(UniqueNonterminals, [&T](llvm::StringRef Name) {
T->Nonterminals.emplace_back();		T->Nonterminals.emplace_back();
T->Nonterminals.back().Name = Name.str();		T->Nonterminals.back().Name = Name.str();
});		});
assert(T->Nonterminals.size() < (1 << (SymbolBits - 1)) &&		assert(T->Nonterminals.size() < (1 << (SymbolBits - 1)) &&
"Too many nonterminals to fit in SymbolID bits!");		"Too many nonterminals to fit in SymbolID bits!");
llvm::sort(T->Nonterminals, [](const GrammarTable::Nonterminal &L,		llvm::sort(T->Nonterminals, [](const GrammarTable::Nonterminal &L,
const GrammarTable::Nonterminal &R) {		const GrammarTable::Nonterminal &R) {
return L.Name < R.Name;		return L.Name < R.Name;
});		});
		// Add an empty string for the corresponding sentinel none extension.
		T->Extensions.push_back("");
		llvm::for_each(UniqueExtensionValues, [&T](llvm::StringRef Name) {
		T->Extensions.emplace_back();
		T->Extensions.back() = Name.str();
		});
		llvm::sort(T->Extensions);
		assert(T->Extensions.front() == "");

// Build name -> ID maps for nonterminals.		// Build name -> ID maps for nonterminals.
for (SymbolID SID = 0; SID < T->Nonterminals.size(); ++SID)		for (SymbolID SID = 0; SID < T->Nonterminals.size(); ++SID)
SymbolIds.try_emplace(T->Nonterminals[SID].Name, SID);		SymbolIds.try_emplace(T->Nonterminals[SID].Name, SID);

// Convert the rules.		// Convert the rules.
T->Rules.reserve(Specs.size());		T->Rules.reserve(Specs.size());
std::vector<SymbolID> Symbols;		std::vector<SymbolID> Symbols;
auto Lookup = [SymbolIds](llvm::StringRef Name) {		auto Lookup = [SymbolIds](llvm::StringRef Name) {
auto It = SymbolIds.find(Name);		auto It = SymbolIds.find(Name);
assert(It != SymbolIds.end() && "Didn't find the symbol in SymbolIds!");		assert(It != SymbolIds.end() && "Didn't find the symbol in SymbolIds!");
return It->second;		return It->second;
};		};
for (const auto &Spec : Specs) {		for (const auto &Spec : Specs) {
assert(Spec.Sequence.size() <= Rule::MaxElements);		assert(Spec.Sequence.size() <= Rule::MaxElements);
Symbols.clear();		Symbols.clear();
for (const RuleSpec::Element &Elt : Spec.Sequence)		for (const RuleSpec::Element &Elt : Spec.Sequence)
Symbols.push_back(Lookup(Elt.Symbol));		Symbols.push_back(Lookup(Elt.Symbol));
T->Rules.push_back(Rule(Lookup(Spec.Target), Symbols));		T->Rules.push_back(Rule(Lookup(Spec.Target), Symbols));
}		}
		sammccallUnsubmitted Not Done Reply Inline Actions This doesn't seem worth diagnosing to me - it seems stylistically weird but not really a problem to put a guard wherever. sammccall: This doesn't seem worth diagnosing to me - it seems stylistically weird but not really a…
		hokeinAuthorUnsubmitted Done Reply Inline Actions This doesn't seem worth diagnosing to me - it seems stylistically weird but not really a problem to put a guard wherever. It is valid in syntax level, but not in semantic level, I think this is confusing -- the bnf parser doesn't emit any warning on this usage, just silently ignore them (grammar writer might think the grammar is correct, guard should work even putting it in the middle of the rules). On the other hand, as you mentioned, grammar file is not user-faced, and we're the current authors, it seems ok to not do it (in favour of simplicity) right now (we might need to reconsider improving it there are new grammar contributors). hokein: > This doesn't seem worth diagnosing to me - it seems stylistically weird but not really a…
		sammccallUnsubmitted Not Done Reply Inline Actions maybe another function to pull out here? applyAttribute(StringRef Name, StringRef Value, AttributeID ValueID, Rule&, Element&) sammccall: maybe another function to pull out here? applyAttribute(StringRef Name, StringRef Value…
		hokeinAuthorUnsubmitted Done Reply Inline Actions sounds good, move the extension bits to a dedicate `applyExtension`. hokein: sounds good, move the extension bits to a dedicate `applyExtension`.
		applyExtension(Specs, *T);
		sammccallUnsubmitted Done Reply Inline Actions doing this outside the loop above means this function has to handle all specs/rules and work out how they correspond, which is fragile. Instead can we do it inside the loop, and just handle a single attribute at a time? for (const auto &Spec : Specs) { // ... create rule for (const auto &Attribute : Spec.Attributes) applyAttribute(Attribute, T->Rules.back()); } sammccall: doing this outside the loop above means this function has to handle all specs/rules and work…

assert(T->Rules.size() < (1 << RuleBits) &&		assert(T->Rules.size() < (1 << RuleBits) &&
"Too many rules to fit in RuleID bits!");		"Too many rules to fit in RuleID bits!");
const auto &SymbolOrder = getTopologicalOrder(T.get());		const auto &SymbolOrder = getTopologicalOrder(T.get());
llvm::stable_sort(		llvm::stable_sort(
T->Rules, [&SymbolOrder](const Rule &Left, const Rule &Right) {		T->Rules, [&SymbolOrder](const Rule &Left, const Rule &Right) {
// Sorted by the topological order of the nonterminal Target.		// Sorted by the topological order of the nonterminal Target.
return SymbolOrder[Left.Target] < SymbolOrder[Right.Target];		return SymbolOrder[Left.Target] < SymbolOrder[Right.Target];
});		});
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	public:
}		}

private:		private:
// Text representation of a BNF grammar rule.		// Text representation of a BNF grammar rule.
struct RuleSpec {		struct RuleSpec {
llvm::StringRef Target;		llvm::StringRef Target;
struct Element {		struct Element {
llvm::StringRef Symbol; // Name of the symbol		llvm::StringRef Symbol; // Name of the symbol
		// Extensions that are associated to the sequence symbol or rule.
		sammccallUnsubmitted Done Reply Inline Actions rather attributes :-( sammccall: rather attributes :-(
		std::vector<std::pair<llvm::StringRef/Key/, llvm::StringRef/Value/>>
		Extensions;
};		};
std::vector<Element> Sequence;		std::vector<Element> Sequence;

std::string toString() const {		std::string toString() const {
std::vector<llvm::StringRef> Body;		std::vector<llvm::StringRef> Body;
for (const auto &E : Sequence)		for (const auto &E : Sequence)
Body.push_back(E.Symbol);		Body.push_back(E.Symbol);
return llvm::formatv("{0} := {1}", Target, llvm::join(Body, " "));		return llvm::formatv("{0} := {1}", Target, llvm::join(Body, " "));
Show All 24 Lines	bool parseLine(llvm::StringRef Line, RuleSpec &Out) {
}		}

Out.Target = Parts.first.trim();		Out.Target = Parts.first.trim();
Out.Sequence.clear();		Out.Sequence.clear();
for (llvm::StringRef Chunk : llvm::split(Parts.second, ' ')) {		for (llvm::StringRef Chunk : llvm::split(Parts.second, ' ')) {
Chunk = Chunk.trim();		Chunk = Chunk.trim();
if (Chunk.empty())		if (Chunk.empty())
continue; // skip empty		continue; // skip empty
		if (Chunk.startswith("[") && Chunk.endswith("]")) {
		sammccallUnsubmitted Not Done Reply Inline Actions Again, I don't think we should be aiming to provide nice diagnostics for each possible way we could get this wrong - the grammar here is part of the pseudo-parser, not user input. sammccall: Again, I don't think we should be aiming to provide nice diagnostics for each possible way we…
		hokeinAuthorUnsubmitted Done Reply Inline Actions fair enough, but we should keep minimal critical diagnostics at least to avoid shooting ourself in the foot. hokein: fair enough, but we should keep minimal critical diagnostics at least to avoid shooting ourself…
		if (Out.Sequence.empty())
		continue;
		parseExtension(Chunk, Out.Sequence.back().Extensions);
		continue;
		}

Out.Sequence.push_back({Chunk});		Out.Sequence.push_back({Chunk});
		uabelhoUnsubmitted Not Done Reply Inline Actions I get a warning/error on this line with this commit: 13:31:17 ../../clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp:235:36: error: missing field 'Attributes' initializer [-Werror,-Wmissing-field-initializers] 13:31:17 Out.Sequence.push_back({Chunk}); 13:31:17 ^ 13:31:17 1 error generated. I see the warning when compiling with clang 8.0. uabelho: I get a warning/error on this line with this commit: ``` 13:31:17 ../../clang-tools…
		hokeinAuthorUnsubmitted Done Reply Inline Actions sorry (I don't this warning enabled), should be fixed in 9ce232fba99c47c3246f06fcbe37c24b9d90585f. hokein: sorry (I don't this warning enabled), should be fixed in…
		uabelhoUnsubmitted Not Done Reply Inline Actions Thanks! uabelho: Thanks!
}		}
return true;		return true;
		}

		bool parseExtension(
		llvm::StringRef Content,
		std::vector<std::pair<llvm::StringRef, llvm::StringRef>> &Out) {
		assert(Content.startswith("[") && Content.endswith("]"));
		for (llvm::StringRef ExtText :
		sammccallUnsubmitted Not Done Reply Inline Actions For simplicity we could also drop this loop. If we ever need to specify two attributes on the same token, `a := b [foo] [bar]` works fine sammccall: For simplicity we could also drop this loop. If we ever need to specify two attributes on the…
		llvm::split(Content.drop_front().drop_back(), ';')) {
		auto KV = ExtText.split('=');
		if (KV.first == ExtText) { // no separator in Line
		sammccallUnsubmitted Not Done Reply Inline Actions If we drop this line, then [foo] will be equivalent to [foo=], i.e. attribute is present and points at the empty string. This seems reasonable to me, and is a good syntax for boolean attributes (where [foo] denotes set and no attribute denotes unset) sammccall: If we drop this line, then [foo] will be equivalent to [foo=], i.e. attribute is present and…
		hokeinAuthorUnsubmitted Done Reply Inline Actions This syntax for bool-type attributes seems good to me. Note that there is a subtle difference with the string-type attributes, where we use the empty string as the sentinel attribute value (denote unset) whose ExtensionID is 0. I think it should not be a big issue (we only handle them is in the bnf parsing time, and propagate field values in `Rule`). hokein: This syntax for bool-type attributes seems good to me. Note that there is a subtle difference…
		Diagnostics.push_back(
		llvm::formatv("Failed to parse extension '{0}': no separator =",
		ExtText)
		.str());
		return false;
		}
		Out.push_back({KV.first, KV.second.trim()});
		}
		return true;
		}
		// Apply the parsed extensions (stored in RuleSpec) to the grammar Rule.
		void applyExtension(llvm::ArrayRef<RuleSpec> Specs, GrammarTable &T) {
		sammccallUnsubmitted Done Reply Inline Actions (This seems like a condition we can diagnose when applying the attribute, instead of needing to do it eagerly here) sammccall: (This seems like a condition we can diagnose when applying the attribute, instead of needing to…
		assert(T.Rules.size() == Specs.size());
		assert(llvm::is_sorted(T.Extensions));
		auto LookupExtensionID = [&T](llvm::StringRef Name) {
		const auto It = llvm::partition_point(
		T.Extensions, [&](llvm::StringRef X) { return X < Name; });
		assert(It != T.Extensions.end() && *It == Name &&
		"Didn't find the symbol in AttrValues!");
		return It - T.Extensions.begin();
};		};
		for (unsigned I = 0; I < Specs.size(); ++I) {
		for (const auto &KV : Specs[I].Sequence.back().Extensions) {
		if (KV.first == "guard") {
		T.Rules[I].Guard = LookupExtensionID(KV.second);
		continue;
		}
		Diagnostics.push_back(
		llvm::formatv("Unknown extension key '{0}'", KV.first).str());
		}
		}
		}

// Inlines all _opt symbols.		// Inlines all _opt symbols.
// For example, a rule E := id +_opt id, after elimination, we have two		// For example, a rule E := id +_opt id, after elimination, we have two
// equivalent rules:		// equivalent rules:
// 1) E := id + id		// 1) E := id + id
// 2) E := id id		// 2) E := id id
std::vector<RuleSpec> eliminateOptional(llvm::ArrayRef<RuleSpec> Input) {		std::vector<RuleSpec> eliminateOptional(llvm::ArrayRef<RuleSpec> Input) {
std::vector<RuleSpec> Results;		std::vector<RuleSpec> Results;
▲ Show 20 Lines • Show All 77 Lines • Show Last 20 Lines

clang-tools-extra/pseudo/unittests/GrammarTest.cpp

Show First 20 Lines • Show All 93 Lines • ▼ Show 20 Lines	TEST_F(GrammarTest, RuleIDSorted) {
)bnf");		)bnf");
ASSERT_TRUE(Diags.empty());		ASSERT_TRUE(Diags.empty());

EXPECT_LT(ruleFor("z"), ruleFor("y"));		EXPECT_LT(ruleFor("z"), ruleFor("y"));
EXPECT_LT(ruleFor("y"), ruleFor("x"));		EXPECT_LT(ruleFor("y"), ruleFor("x"));
EXPECT_LT(ruleFor("x"), ruleFor("_"));		EXPECT_LT(ruleFor("x"), ruleFor("_"));
}		}

		TEST_F(GrammarTest, Annotation) {
		build(R"bnf(
		_ := IDENTIFIER [guard=override]
		sammccallUnsubmitted Done Reply Inline Actions if you like you could add 3 more rules, with no annotation, guard=override, guard=somethingelse and verify they're zero, equal, nonequal respectively sammccall: if you like you could add 3 more rules, with no annotation, guard=override, guard=somethingelse…
		)bnf");
		ASSERT_TRUE(Diags.empty());
		EXPECT_TRUE(G->lookupRule(ruleFor("_")).Guard);
		}

TEST_F(GrammarTest, Diagnostics) {		TEST_F(GrammarTest, Diagnostics) {
build(R"cpp(		build(R"cpp(
_ := ,_opt		_ := ,_opt
_ := undefined-sym		_ := undefined-sym
null :=		null :=
_ := IDENFIFIE # a typo of the terminal IDENFITIER		_ := IDENFIFIE # a typo of the terminal IDENFITIER

invalid		invalid
# cycle		# cycle
a := b		a := b
b := a		b := a

		_ := IDENTIFIER [guard=override;unknown=value]
)cpp");		)cpp");

EXPECT_EQ(G->underscore(), id("_"));		EXPECT_EQ(G->underscore(), id("_"));
EXPECT_THAT(Diags, UnorderedElementsAre(		EXPECT_THAT(Diags, UnorderedElementsAre(
"Rule '_ := ,_opt' has a nullable RHS",		"Rule '_ := ,_opt' has a nullable RHS",
"Rule 'null := ' has a nullable RHS",		"Rule 'null := ' has a nullable RHS",
"No rules for nonterminal: undefined-sym",		"No rules for nonterminal: undefined-sym",
"Failed to parse 'invalid': no separator :=",		"Failed to parse 'invalid': no separator :=",
"Token-like name IDENFIFIE is used as a nonterminal",		"Token-like name IDENFIFIE is used as a nonterminal",
"No rules for nonterminal: IDENFIFIE",		"No rules for nonterminal: IDENFIFIE",
"The grammar contains a cycle involving symbol a"));		"The grammar contains a cycle involving symbol a",
		"Unknown extension key 'unknown'"));
}		}

TEST_F(GrammarTest, FirstAndFollowSets) {		TEST_F(GrammarTest, FirstAndFollowSets) {
build(		build(
R"bnf(		R"bnf(
_ := expr		_ := expr
expr := expr - term		expr := expr - term
expr := term		expr := term
▲ Show 20 Lines • Show All 57 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[pseudo] Add grammar annotations support.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 433354

clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h

clang-tools-extra/pseudo/lib/grammar/Grammar.cpp

clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp

clang-tools-extra/pseudo/unittests/GrammarTest.cpp

[pseudo] Add grammar annotations support.
ClosedPublic