This is an archive of the discontinued LLVM Phabricator instance.

[pseudo] Add grammar annotations support.
ClosedPublic

Authored by hokein on May 27 2022, 6:28 AM.

Details

Summary
  • Add annotation handling ([key=value]) in the BNF grammar parser;
  • Define and setup the API in the grammar for attributes;
  • Implement a builtin guard for two simple c++ contexual-override/final use cases;

Diff Detail

Event Timeline

hokein created this revision.May 27 2022, 6:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 27 2022, 6:28 AM
Herald added subscribers: mgrang, mgorny. · View Herald Transcript
hokein requested review of this revision.May 27 2022, 6:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 27 2022, 6:28 AM

This patch might cover too many things (at least we could split the guide/glr implementation bit), but want to give you an overview of the picture first.

hokein added inline comments.May 27 2022, 6:40 AM
clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h
90

new names are welcome. Attribute is the name I came up with (I think it is clearer than the original Hook),

97

I'm not quite happy with using the value as the ID, I think we can encode the Key into the ID as well (ID := Key | Value).

Similar to the generated enum name, currently we just use the name of Value (Override), it will be more confusing when we add more keys/values, one idea is to add key as well (GuardOverride etc?).

Nice!

This has tests for the parsing-the-attribute bits, but I think we're missing tests for the actual guards added.

clang-tools-extra/pseudo/include/clang-pseudo/GLR.h
114 ↗(On Diff #432538)

this signature seems a little off to me.

Guard logic is identified by a guard ID and we look it up, but we're not passing in the guard ID for some reason. Instead we pass in the rule ID, which this function uses to look up the guard ID again. Why not just pass the guard ID?

That said, it's a bit surprising that we have the rules declare separate guard rules, but then they're all implemented by one function. A map of functions seems more natural. (but not a performance advantage I guess)

Naming: it's confusing that this umbrella function is called "guard", but a specific rule like "guard=Override" is also called "guard". I'd either call this a GuardTester or bypass the issue by making this a DenseMap whose values are Guards.

Altogether I might write this as:

using Guard = llvm::function_ref<bool(llvm::ArrayRef<const ForestNode *> RHS, const TokenStream &, const Grammar &)>;
...
const DenseMap<AttributeID, Guard> &Guards;
129 ↗(On Diff #432538)

either a reference, or a lightweight reference type like function_ref, for consistency with other fields?

clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h
22

nit: first "guard" should be "key"?

90

I don't understand why we need AttributeKey, and to parse things into it, rather than just using StringRef.

(Also here you've put it in the header, but it's not used in any interfaces)

97

Packing all the possible values into a single unstructured enum is weird, but the problem (if it's a problem) is _redundancy_ and packing the attribute name in there as well just makes that worse.

If we want to fix this, I think a sounder fix is something like:

// Grammar.h
using AttributeEnum = uint8_t;
struct Rule { ...; AttributeEnum Guard = 0; }
// CXX.h
enum class GuardID : AttributeEnum { Override = 1; };
enum class RecoveryID : AttributeEnum { Parens = 1; };

i.e. we keep track of the unique values per attribute name and emit an enum for each attribute. This probably means emitting the actual enum definitions in tablegen, rather than a generic .inc file.

(But I'm also fine with not fixing it at this point, and keeping one flat enum for the values)

98

I agree HookID is an unfortunately vague name, but I did consider and reject AttributeID which I think is actively confusing.
In the plain-english meaning, "guard" is an attribute, and "override" is its value. So assigning "override" an "attribute ID" is very confusing.

The difficulty of naming this is that we can't refer to the semantics (e.g. "Guard") as we want a single namespace for all of them, and all they have in common is syntax. However referring to "attribute values" makes for confusing internal data structures, as we don't model the whole attribute generically.

So we're left with names that suggest "this is some kind of anchor to attach custom code to". Maybe ExtensionID is a little clearer than HookID?

100

If we're going to use implicit bool conversions, we should not pretend this is some arbitrary value and drop the constant I think.

105

if this optional and rarely set, I think it's should be a constructor parameter - this constructor could become unwieldy.

It also forces you to process the attributes before creating the Rule, and reading the code doing it afterwards seems more natural.

clang-tools-extra/pseudo/include/clang-pseudo/cxx/CXX.h
39 ↗(On Diff #432538)

this pattern is fragile. (and we're going to end up adding rules, too)

I'd suggest either:

  • generate a separate file for each enum
  • generate a file containing the full definiton of all the enums (we're not using the generality)
  • add the fallback #defines to the .inc file so the caller doesn't have to provide them all
63 ↗(On Diff #432538)

"guard" looks like a verb here. If we're going to use it as both noun and verb, the relationship between the two needs to match the metaphor (in this case, it needs to be the guard who is doing the guarding) and it's not clear that it is.

I think safest to stick to const Guard &getGuard() or so.

clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp
47

This patch takes this function from 80->120 lines, and it's starting to feel unwieldy.

I think moving building the UniqueNonterminals/SymbolIDs + the new maps into a function (which returns a struct of SymbolIds and AttrValueIDs) would make this more manageable.

51

unused

52

why ""?

60

AttrValueIDs never seems to get populated, so I'm not sure this works.

104

This doesn't seem worth diagnosing to me - it seems stylistically weird but not really a problem to put a guard wherever.

104

maybe another function to pull out here?
applyAttribute(StringRef Name, StringRef Value, AttributeID ValueID, Rule&, Element&)

227

Again, I don't think we should be aiming to provide nice diagnostics for each possible way we could get this wrong - the grammar here is part of the pseudo-parser, not user input.

259

(This seems like a condition we can diagnose when applying the attribute, instead of needing to do it eagerly here)

clang-tools-extra/pseudo/unittests/GLRTest.cpp
161 ↗(On Diff #432538)

These empty streams aren't valid (in multiple ways: they're not finalized, and they don't contain the tokens that the nodes refer to).

Maybe add a tokenStreamForTest(StringLiteral) function somewhere? Safe because guaranteed to be null-terminated and have sufficient lifetime...

hokein updated this revision to Diff 433351.EditedJun 1 2022, 4:29 AM
hokein marked 3 inline comments as done and an inline comment as not done.

address the grammar part comments:

  • Rename to ExtensionID;
  • Simplify the BNF parsing bit;
hokein added a comment.Jun 1 2022, 4:38 AM

I addressed comments on the grammar part, (the remaining GLR, pseudo_gen parts are not covered yet), I think it would be better them into different patches, so that we can land the grammar bit first, then start doing the error recovery and guard implementation in parallel.

clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h
105

I suppose you mean we should drop the Guard parameter in the constructor, yeah, that sounds good to me, and simplifies the BNF parsing bit.

clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp
52

This was for the sentinel default 0 attribute value.

60

oops, this part of code wasn't fully cleaned up. AttrValueIDs is not needed indeed.

104

sounds good, move the extension bits to a dedicate applyExtension.

104

This doesn't seem worth diagnosing to me - it seems stylistically weird but not really a problem to put a guard wherever.

It is valid in syntax level, but not in semantic level, I think this is confusing -- the bnf parser doesn't emit any warning on this usage, just silently ignore them (grammar writer might think the grammar is correct, guard should work even putting it in the middle of the rules).

On the other hand, as you mentioned, grammar file is not user-faced, and we're the current authors, it seems ok to not do it (in favour of simplicity) right now (we might need to reconsider improving it there are new grammar contributors).

227

fair enough, but we should keep minimal critical diagnostics at least to avoid shooting ourself in the foot.

hokein updated this revision to Diff 433354.Jun 1 2022, 4:39 AM
hokein marked 2 inline comments as done.

remove a left-out change.

Thanks, this looks better.
Sorry about confusion with the names - I think annotation is a great name, and we should only use "extension" for the narrower "opaque thing that an annotation value refers to".
Happy to chat about this offline if you like

clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h
22

Hmm, I think we misunderstood each other...

I think annotations is a good name for the syntactic [foo=bar] bit. I think the comment here is now harder to follow, and would prefer the old version.

It's specifically "bar" which indicates some identifier whose semantics will be provided externally, so "bar" is a reference to an extension.

If you want to mention the concept of an extension, then maybe at the end...
Each unique annotation value creates an extension point. The Override guard is implemented in CXX.cpp by binding the ExtensionID for Override to a C++ function that performs the check.

88

"an extension uniquely identifies an extension" is a tautology.

An extension is a piece of native code specific to a grammar that modifies the behavior of annotated rules.
One ExtensionID is assigned for each unique attribute value (all attributes share a namespace).
90

I don't think this last sentence is useful as-is, it looks important but it's not clear who it constrains.
Also "attribute" rather than "key", an attribute is a thing that objects have, a key is a thing maps have.

217

attribute values, or extension names

(attributes are syntactic and have string values, extensions are semantic and can be referred to by names).

clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp
106

doing this outside the loop above means this function has to handle all specs/rules and work out how they correspond, which is fragile.

Instead can we do it inside the loop, and just handle a single attribute at a time?

for (const auto &Spec : Specs) {
  // ... create rule
  for (const auto &Attribute : Spec.Attributes)
    applyAttribute(Attribute, T->Rules.back());
}
184

rather attributes :-(

244

For simplicity we could also drop this loop. If we ever need to specify two attributes on the same token, a := b [foo] [bar] works fine

247

If we drop this line, then [foo] will be equivalent to [foo=], i.e. attribute is present and points at the empty string. This seems reasonable to me, and is a good syntax for boolean attributes (where [foo] denotes set and no attribute denotes unset)

clang-tools-extra/pseudo/unittests/GrammarTest.cpp
104

if you like you could add 3 more rules, with no annotation, guard=override, guard=somethingelse
and verify they're zero, equal, nonequal respectively

hokein updated this revision to Diff 434906.Jun 7 2022, 11:55 AM
hokein marked 5 inline comments as done.
  • bring back the attribute concept, narrow down the ExtensionID scope (only used for semantic);
  • loose and simplify the BNF annotations parsing; ([] only allows single attribute, attribute without value are acceptable);
  • address other comments;
hokein added inline comments.Jun 7 2022, 11:58 AM
clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h
22

oops, I misunderstood your "extensionID" comment completely (sorry). Bring it back now.

217

renamed to attribute values.

clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp
247

This syntax for bool-type attributes seems good to me.
Note that there is a subtle difference with the string-type attributes, where we use the empty string as the sentinel attribute value (denote unset) whose ExtensionID is 0. I think it should not be a big issue (we only handle them is in the bnf parsing time, and propagate field values in Rule).

sammccall accepted this revision.Jun 8 2022, 2:34 PM
sammccall added inline comments.
clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h
91

nit: I'd drop this sentence, as using it to index into a table is only ever used to produce debug representations I think - usually we use ExtensionID as an index into a map

219

looking at this again, ExtensionNames seems clearer as ExtensionNames[ExtID] seems more obvious that the kinds agree. But up to you

This revision is now accepted and ready to land.Jun 8 2022, 2:34 PM
hokein marked an inline comment as done.Jun 9 2022, 3:03 AM
hokein added inline comments.
clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h
219

I considered extensionNames, but decided to use AttributeValues. AttributeValues corresponds to the syntactic grammar [key=value] where we call value as the attribute value, while using ExtensionNames is less clearer. And yeah AttributeValues[ExtID] is somehow confusing, but I think it is fine, as it is just print code.

This revision was landed with ongoing or failed builds.Jun 9 2022, 3:08 AM
This revision was automatically updated to reflect the committed changes.
uabelho added a subscriber: uabelho.Jun 9 2022, 4:46 AM
uabelho added inline comments.
clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp
235

I get a warning/error on this line with this commit:

13:31:17 ../../clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp:235:36: error: missing field 'Attributes' initializer [-Werror,-Wmissing-field-initializers]
13:31:17       Out.Sequence.push_back({Chunk});
13:31:17                                    ^
13:31:17 1 error generated.

I see the warning when compiling with clang 8.0.

hokein added inline comments.Jun 9 2022, 5:13 AM
clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp
235

sorry (I don't this warning enabled), should be fixed in 9ce232fba99c47c3246f06fcbe37c24b9d90585f.

uabelho added inline comments.Jun 9 2022, 5:30 AM
clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp
235

Thanks!