This is an archive of the discontinued LLVM Phabricator instance.

Initial support for letting plugins perform custom parsing of attribute arguments.
Needs ReviewPublic

Authored by jonathan.protzenko on Aug 18 2020, 1:28 PM.

Details

Summary

This is a preliminary patch that I hope can serve as a basis for discussion. The
design summary below is copy/pasted from https://bugs.llvm.org/show_bug.cgi?id=46446#c7

(I'm intentionally posting a work-in-progress so that I can make sure that things
are roughly looking good; I'd like to defer details (clang-format'ing my
changes, naming conventions, tests) to later once the design is settled.)

The patch essentially adds the ability for a plugin to handle the parsing of a
CXX11 attribute's arguments. This means that with this patch, plugin-defined
CXX11 attributes are no longer limited to constant attributes that take no
arguments.

The patch intentionally gives the plugin a fair amount of freedom, to enable
interesting use-cases (see below).

The patch does not (yet) tackle the ability for out-of-tree plugins to extend
Attr.td. I'd be happy to work on that once the present patch makes progress.

Design

I chose to extend the ParsedAttrInfo class. This allows reusing the logic that
searches for an attribute plugin that has registered a matching spelling, and
reusing the existing registry of plugins. (I contemplated a separate registry,
but thought it would duplicate the spelling-search logic and would also lead to
code duplication in the plugin itself.)

By default, ParsedAttrInfo's (most of which do *not* come from plugins) are
happy to let the built-in clang parsing logic take effect, enforcing tweaking
knobs such as NumArgs and OptArgs. Plugins can, if they are so inclined, elect
to bypass clang's parsing logic (which, as we saw on Bugzilla, ignores CXX11
attribute arguments when they're not known to clang). In that case, plugins are
on their own; they are expected to handle all of the parsing, as well as pushing
the resulting ParsedAttr into the ParsedAttributes. They then report whether the
arguments are syntactically valid or not and the rest of the parsing code
handles error reporting.

(Side-note: this means that it's a three-state return value, and I'm currently
reusing the existing AttrHandling enum, but it could be improved with either
better names or a separate enum.)

The plugin receives, in addition to location information, two key parameters:

  • the parser, obviously
  • a possibly-null Declarator representing the partially parsed declaration that the attribute is attached to, if any.

The second parameter may be the most surprising part of this patch, so let me
try to provide some motivation and context.

My goal is to enable plugins to perform advanced static analysis and code
generation, relying on rich CXX11 attributes capturing deep semantic information
relating, say, a function's parameters, or a struct's fields.

You can easily imagine someone authoring a static analyzer that recognizes:

typedef struct
{
  size_t len;
  uint8_t *chars;
    [[ my_analyzer::invariant(strlen(chars) == len) ]];
}
managed_string;

The point here being to leverage a nice CXX11 attribute that benefits from
typo-checking, proper name and macro resolution, etc. rather than using some
unreliable syntax embedded in comments like so many other tools are doing, or
worse, a custom C++ parser.

In our case, we're auto-generating parsers and serializers from a struct type
definition (see https://www.usenix.org/system/files/sec19-ramananandro_0.pdf for
more info), so I'd like to author:

typedef struct {
  uint32_t min_version;
  uint32_t max_version
    [[ everparse::constraint (max_version >= min_version && min_version >= 2 && max_version <= 4) ]];
} my_network_format;
`

Of course, our actual network formats are much more complicated, but this is
just for the sake of example. My point is that allowing plugins a substantial
degree of freedom is likely to enable a flurry of novel and exciting use-cases
based on clang.

Back to the design of the patch, receiving the Declarator allows me to do things
like this (in the plugin):

Sema &S = P->getActions();
TypeSourceInfo *T = S.GetTypeForDeclarator(*D, P->getCurScope());
QualType R = T->getType();
VarDecl *VD = VarDecl::Create(S.getASTContext(), S.CurContext,
    D->getBeginLoc(), D->getIdentifierLoc(), D->getIdentifier(), R,
    T,
    SC_None);
// Doing this makes sure Sema is aware of the new scope entry, meaning this name
// will be in scope when parsing the expression. (Parsing and scope
// resolution are intertwined.)
VD->setImplicit(true);
S.PushOnScopeChains(VD, P->getCurScope());

This allows my plugin to successfully parse and recognize complex arguments that
refer to the current declarator, in effect letting my plugin define its own
extended syntax and scoping rules.

I hope this helps, I'll keep working on the patch (this is the first version
that seems to do something useful) but thought it'd be helpful to get the
discussion started in the meanwhile.

Details (optional reading)

  • I was able to refactor the lookup-by-spelling logic in ParsedAttrInfo::get -- I'm curious to know if the two overloaded methods are conforming to clang's style
  • I had to switch ParsedAttrInfo from a struct to a class because of a warning related to Microsoft ABI standards violations
  • I'm wondering if the Declarator that the plugin receives should instead be a more general class (with a Kind()) that covers the union of all the "things currently being parsed that an attribute may be attached to". But, since I couldn't come up with other compelling examples, I've left it as-is until I find a strong argument in favor of introducing extra complexity.

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptAug 18 2020, 1:28 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
jonathan.protzenko requested review of this revision.Aug 18 2020, 1:28 PM

Thank you for the work in progress patch as a starting point for discussion! I'm adding a few more reviewers who may be interested in this work.

One part of the design I'm not yet certain of is passing in the Declarator * to the various parsing functions. While I understand the idea (and appreciate the explanation of why you'd like this), my concern is that attributes appertain to more than just declarators (like declaration specifiers, types, and statements). Do you envision supporting all the various entities that an attribute can appertain to? Also, do you intend to provide the same information to the other spellings, like __attribute__ or __declspec (because plugins can define custom attributes with those spellings as well)?

clang/include/clang/Sema/ParsedAttr.h
97

I think this should also have a parameter for the syntax used for the attribute, because different syntaxes may have different rules for how to parse the arguments to the attribute (this is one of the reasons why we have common attribute argument parsing as well as syntax-specific attribute argument parsing in Parser).

Thanks for the review!

Regarding Declarator: I briefly mentioned something related in the details section... actually maybe you can tell me what's the best approach here. I was thinking of just passing in a void* and expecting plugins to do

if (const Declarator *D = isa_and_nonnull<Declarator>(Context))

This would allow a future-proof API where we can, in the future, pass in more classes for the "Context" argument (not just a Declarator), relying on clients doing dynamic casts. If I understood https://llvm.org/docs/ProgrammersManual.html#isa this void* cast would work?

My only hesitation is I'm not sure about the right technical device to use for this polymorphism pattern, as I've seen a couple related things in the clang code base:

  • I've seen a pattern where there's a wrapper class that has a Kind() method and then allows downcasting based on the results of the Kind (seems overkill here)
  • I've also seen references to opaque_ptr somewhere with a get<T> method... (maybe these are helpers to do what I'm doing better?)

If you can confirm the void* is the right way to go (or point me to a suitable pattern I can take inspiration from), I'll submit a revised patch, along with the extra spelling argument (good point, thanks!).

I also seem to have broken a bunch of tests, although I'm not sure why. If you see an obvious cause, I'm happy to get some pointers, otherwise, I'll just dig in an debug as I extend the test suite once the patch stabilizes.

Thanks for the review!

Regarding Declarator: I briefly mentioned something related in the details section... actually maybe you can tell me what's the best approach here. I was thinking of just passing in a void* and expecting plugins to do

if (const Declarator *D = isa_and_nonnull<Declarator>(Context))

This would allow a future-proof API where we can, in the future, pass in more classes for the "Context" argument (not just a Declarator), relying on clients doing dynamic casts. If I understood https://llvm.org/docs/ProgrammersManual.html#isa this void* cast would work?

I don't think isa<> would work well with a void * because the LLVM casting infrastructure works by eventually trying to call a static function on a class, and that static function is given a pointer value which it often calls a member function on. If the types are all correct, then things should work out okay, but I'd not want to design a plugin feature around that unsafe of an interface. In fact, I think it would require the user to use C-style casts because you cannot form a reference to a void *.

My only hesitation is I'm not sure about the right technical device to use for this polymorphism pattern, as I've seen a couple related things in the clang code base:

  • I've seen a pattern where there's a wrapper class that has a Kind() method and then allows downcasting based on the results of the Kind (seems overkill here)
  • I've also seen references to opaque_ptr somewhere with a get<T> method... (maybe these are helpers to do what I'm doing better?)

If you can confirm the void* is the right way to go (or point me to a suitable pattern I can take inspiration from), I'll submit a revised patch, along with the extra spelling argument (good point, thanks!).

I don't think void * is the right way to go and I'm hoping we can devise a way to not require the entity being attributed to be passed when parsing the attribute. For instance, some attributes are parsed before you know what you're attaching them to: [[foobar]] <whatever else>; could be a statement attribute or a declaration attribute, depending on what <whatever else> is. Another example is: int f [[foobar]] <whatever else>; where the attribute could be appertaining to a function (if <whatever else> was ()) or a variable (if <whatever else> was = 12). My intuition is that we're going to need to wait until we know what AST node we're dealing with before asking a plugin to decide whether an attribute appertains to it or not. WDYT?

I'm hoping we can devise a way to not require the entity being attributed to be passed when parsing the attribute

That sounds nice. My only concern with this proposal is to make sure this doesn't rule out the interesting use-cases I was mentioning, like the managed_string example I presented in the motivation section. These use-cases would need to have some contextual information to perform meaningful parsing. Note that I'm not particularly committed to struct field declarations, one could imagine interesting use-cases with function arguments, such as:

void f (int *arr [[ alloc_size(arr) == 8 ]], size_t idx [[ idx <= 4 ]])

But I agree about the parsing problems you mention, and being unable to meaningfully pass a Declarator (or anything else) based on incomplete parsing information. Here's a solution that might makes us both happy :), but you'd have to tell me if this is workable.

  • Upon encountering [[ attr(X) ]] where attr is a plugin-handled attribute: clang parses X as token soup, only caring about finding a matching parenthesis
  • The complete construct (declaration, type, statement, or whatever the attribute appertains to) is parsed.
  • Plugin's diagAppertainsToDecl gets called (as before). Plugin is free to store in its internal state whatever information from the Decl is relevant.
  • Plugin's parseAttributePayload gets called in a state that consumes the token soup (not sure if that's feasible?). The parseAttributePayload method does not receive a Declarator or anything else. Plugins can rely on the fact that if diagAppertainsToDecl returns true, parseAttributePayload will be called next, and refer to its internal state to fetch whatever information from the Decl it needs.

Does this stand any chance of working?

Bonus question: seeing diagAppertainsToDecl, the name of this function led me to believe that for now, plugins could only handle attributes attached to declarations. Is this the case?

I'm hoping we can devise a way to not require the entity being attributed to be passed when parsing the attribute

That sounds nice. My only concern with this proposal is to make sure this doesn't rule out the interesting use-cases I was mentioning, like the managed_string example I presented in the motivation section.

That should be possible.

These use-cases would need to have some contextual information to perform meaningful parsing. Note that I'm not particularly committed to struct field declarations, one could imagine interesting use-cases with function arguments, such as:

void f (int *arr [[ alloc_size(arr) == 8 ]], size_t idx [[ idx <= 4 ]])

But I agree about the parsing problems you mention, and being unable to meaningfully pass a Declarator (or anything else) based on incomplete parsing information. Here's a solution that might makes us both happy :), but you'd have to tell me if this is workable.

  • Upon encountering [[ attr(X) ]] where attr is a plugin-handled attribute: clang parses X as token soup, only caring about finding a matching parenthesis

Just to be clear, what you mean is to lex all of the tokens comprising X and store them off for later?

  • The complete construct (declaration, type, statement, or whatever the attribute appertains to) is parsed.
  • Plugin's diagAppertainsToDecl gets called (as before). Plugin is free to store in its internal state whatever information from the Decl is relevant.
  • Plugin's parseAttributePayload gets called in a state that consumes the token soup (not sure if that's feasible?). The parseAttributePayload method does not receive a Declarator or anything else. Plugins can rely on the fact that if diagAppertainsToDecl returns true, parseAttributePayload will be called next, and refer to its internal state to fetch whatever information from the Decl it needs.

Does this stand any chance of working?

This is along the same lines of what I was thinking, but there is one potential snag. Clang layers our library components and Sema should never refer back to the Parser (it doesn't link the library in). Because you have an odd requirement where you need to do further parsing during the semantics stage, you may run into awkwardness from this. Also, you may have to worry about needing this functionality recursively. e.g., int x [[your_attr(int y [[your_attr(int z;)]];)]]; which could be... interesting.

Bonus question: seeing diagAppertainsToDecl, the name of this function led me to believe that for now, plugins could only handle attributes attached to declarations. Is this the case?

Correct, the plugin system does not yet handle type or statement attributes (it was focusing on declaration attributes first, which are the lion's share of attributes in Clang).

  • Upon encountering [[ attr(X) ]] where attr is a plugin-handled attribute: clang parses X as token soup, only caring about finding a matching parenthesis

Just to be clear, what you mean is to lex all of the tokens comprising X and store them off for later?

Correct.

Does this stand any chance of working?

This is along the same lines of what I was thinking, but there is one potential snag. Clang layers our library components and Sema should never refer back to the Parser (it doesn't link the library in). Because you have an odd requirement where you need to do further parsing during the semantics stage, you may run into awkwardness from this. Also, you may have to worry about needing this functionality recursively. e.g., int x [[your_attr(int y [[your_attr(int z;)]];)]]; which could be... interesting.

I guess I'd have to i) call the action to create, say, a Decl, then later ii) call the plugin and extend the Decl with fresh attributes, assuming the attribute vector on the Decl can be grown *after* the Decl is created -- hopefully this can be driven from the parser, without a back-edge in the dependency graph? Or maybe I'm missing something?

Do you have any example somewhere of how to store tokens and re-fill the lex buffer with them later on? This is much more involved than my current patch so may take me a while to figure out.

Bonus question: seeing diagAppertainsToDecl, the name of this function led me to believe that for now, plugins could only handle attributes attached to declarations. Is this the case?

Correct, the plugin system does not yet handle type or statement attributes (it was focusing on declaration attributes first, which are the lion's share of attributes in Clang).

Ok, I was asking because if there's no plans to handle type or statement attributes via a plugin, then we're somewhat over-engineering this.

  • Upon encountering [[ attr(X) ]] where attr is a plugin-handled attribute: clang parses X as token soup, only caring about finding a matching parenthesis

Just to be clear, what you mean is to lex all of the tokens comprising X and store them off for later?

Correct.

Does this stand any chance of working?

This is along the same lines of what I was thinking, but there is one potential snag. Clang layers our library components and Sema should never refer back to the Parser (it doesn't link the library in). Because you have an odd requirement where you need to do further parsing during the semantics stage, you may run into awkwardness from this. Also, you may have to worry about needing this functionality recursively. e.g., int x [[your_attr(int y [[your_attr(int z;)]];)]]; which could be... interesting.

I guess I'd have to i) call the action to create, say, a Decl, then later ii) call the plugin and extend the Decl with fresh attributes, assuming the attribute vector on the Decl can be grown *after* the Decl is created -- hopefully this can be driven from the parser, without a back-edge in the dependency graph? Or maybe I'm missing something?

I can't think of another case where we've needed to parse from within Sema, so I'm not certain what issues you will or won't run into. The concern I have is that the parser calls into sema for semantic checking, so this causes a dependency cycle that the components may not handle well (or they may handle it just fine). e.g., int x [[whatever(int y;)]]; the parser will parse the int x [[...]] bit, then call into Sema to check the attribute, which would then spawn a parser to parse the int y; bit, which calls into Sema again for semantic handling. It's that last step that may be an issue (for instance, it may introduce a new AST node for int y; along with the semantic checking of the declaration).

Do you have any example somewhere of how to store tokens and re-fill the lex buffer with them later on? This is much more involved than my current patch so may take me a while to figure out.

We don't really have examples of it quite in this way, but we do have a TokenLexer object that you would likely need to use. I think you'd squirrel away the tokens when parsing the attribute argument clause, and then make a new TokenLexer that is used by the parser's Preprocessor member for parsing the tokens. This gets used when expanding tokens from _Pragma, for instance.

Bonus question: seeing diagAppertainsToDecl, the name of this function led me to believe that for now, plugins could only handle attributes attached to declarations. Is this the case?

Correct, the plugin system does not yet handle type or statement attributes (it was focusing on declaration attributes first, which are the lion's share of attributes in Clang).

Ok, I was asking because if there's no plans to handle type or statement attributes via a plugin, then we're somewhat over-engineering this.

There are plans to support those, but I'm not certain if anyone is actively working on adding that support. I want to avoid adding anything that would make supporting statement and type attributes harder, and I'd be worried that passing in the Declarator * would make that project harder to get started on.

I'm responding here after a ping on Bugzilla. The short version is I've stopped working on this patch. While I like the revised design that involves storing a list of tokens to make plugins even more powerful, this turned out to be a much bigger implementation effort than I initially anticipated. I was unable to find help or additional resources for this work at Microsoft; the internal project that might have benefited from this work turned out to not need it *that* bad after all. So, regrettably, that meant there was little incentive left for me to pursue this patch.

I'll keep monitoring this as well as Bugzilla for discussions, and I'll provide information/context on design in whichever way that I can. Thanks again Aaron for all your precious help throughout these discussions -- without your guidance I probably wouldn't have gotten anywhere at all.