This is an archive of the discontinued LLVM Phabricator instance.

[pseudo] Implement a guard to determine function declarator.
ClosedPublic

Authored by hokein on Jul 6 2022, 12:34 PM.

Details

Summary

This eliminates some simple-declaration/function-definition false parses.

  • implement a function to determine whether a declarator ForestNode is a function declarator;
  • extend the standard declarator to two guarded function-declarator and non-function-declarator nonterminals;

Diff Detail

Event Timeline

hokein created this revision.Jul 6 2022, 12:34 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 6 2022, 12:34 PM
hokein requested review of this revision.Jul 6 2022, 12:34 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 6 2022, 12:34 PM
sammccall added inline comments.Jul 7 2022, 7:03 AM
clang-tools-extra/pseudo/gen/Main.cpp
60 ↗(On Diff #442666)

This code is copied right?

62 ↗(On Diff #442666)

this deserves a comment: "Compared to in the grammar, ; becomes semi and goto becomes kw_goto"?

113 ↗(On Diff #442666)

pull out a ruleName() too?

or mangleSymbol()/mangleRule() by analogy with linker name mangling

clang-tools-extra/pseudo/include/clang-pseudo/Language.h
29 ↗(On Diff #442666)

This allows/encourages guards that dynamically consider multiple rules, which increases the amount of coupling between rules and grammar.

Can we express this as
function-declarator := declarator [guard=FunctionDeclarator]?

This does mangle the grammar a bit to our implementation, and introduces two names for the same concept (guard & nonterminal).
If this feels ugly and redundant, another option would be to change the guard to be specified by the rule ID instead of an extension ID:

function-declarator := declarator [guard]

DenseMap<RuleID, RuleGuard> Guards;
clang-tools-extra/pseudo/lib/cxx/CXX.cpp
108

The "dfs" doesn't actually branch anywhere, so it's just a linear walk, and I think replacing the recursion with iteration is actually more readable here.

Also, the first known kind on the walk up == the last known kind on the walk down.

So I think this can be written as:

bool IsFunction = false;
// Walk down the declarator chain, innermost one wins.
// e.g. (*x)() is a non function, but *(x()) is a function.
for (;;) {
  if (Declarator->kind() != Sequence) // not well-formed, guess
    return IsFunction;

  switch (Declarator->rule()) {
    case noptr_declarator$$declarator_id: // reached the bottom
      return IsFunction;
     // *X is a nonfunction (unless X is a function).
    case ptr_declarator$$ptr_operator$ptr_declarator:
      Declarator = Declarator->elements()[1];
      IsFunction = false;
      continue;
     // X() is a function (unless X is a pointer or similar)
    case declarator$$noptr_declarator$parameters_and_qualifiers$...:
      Declarator = Declarator->elements()[0];
      IsFunction = true;
      continue;
    // (X) is whatever X is.
    case declarator$$l_paren$ptr_declarator$r_paren:
      Declarator = Declarator->elements()[1];
      continue;
    // ... more cases ...
    default:
      assert(false && "unhandled declarator for IsFunction");
      return IsFunction; 
  }
}
110

yes, we should. We can't descend, so I think we should just treat it as if it were an identifier.

158

if you haven't handled all rule kinds, you're just falling off the end here?
We should have an assertion in debug mode and probably treat it as opaque in production I think

sammccall added inline comments.Jul 7 2022, 10:30 AM
clang-tools-extra/pseudo/gen/Main.cpp
114 ↗(On Diff #442666)

so the dollar signs are a practical problem: at least on my machine the warning is on by default, so we emit thousands of warnings and it's impossible to find the error among them.

At *minimum* we need to use pragmas or something to suppress the warning.

But using nonstandard C++ seems likely to cause other problems and limit portability for not-great reasons.

Some ideas:
-Rule::ptr_declarator__EQUALS__ptr_operator__ptr_declarator (this violates the internal __ rule, though that one is widely ignored)

  • Rule::ptr_declarator_0ptr_operator_1ptr_declarator
  • Rule::PtrDeclarator_EQ_PtrOperator_PtrDeclarator (yet another spelling variation, but quite readable. We could even change the spellings in the BNF file...)
clang-tools-extra/pseudo/include/clang-pseudo/Language.h
29 ↗(On Diff #442666)

Having played with this idea a bit, I do think we should strongly consider dispatching on the rule ID rather than a named extension. I went to add some more guards (on NUMERIC_CONSTANT etc) and they really are 1:1 with rules.

hokein added inline comments.Jul 7 2022, 1:25 PM
clang-tools-extra/pseudo/gen/Main.cpp
114 ↗(On Diff #442666)

so the dollar signs are a practical problem: at least on my machine the warning is on by default, so we emit thousands of warnings and it's impossible to find the error among them.

I didn't see this warning on my machine, I'm using the g++.

Rule::ptr_declaratorEQUALSptr_operatorptr_declarator (this violates the internal rule, though that one is widely ignored)

It is not quite readable to me.

Rule::PtrDeclarator_EQ_PtrOperator_PtrDeclarator (yet another spelling variation, but quite readable. We could even change the spellings in the BNF file...)

It is better, but I'm not willing to change the spelling to CamelCase in bnf grammar, I like the current ptr-declarator name, and it matches the standard style. We could keep them unchanged in the bnf, and do the conversion during the mangling.

Rule::ptr_declarator_0ptr_operator_1ptr_declarator

I'm starting to like it now. I find that encoding the index for RHS is useful, especially when we do some work on the ForestNode based on the rules. So I will vote this one.

clang-tools-extra/pseudo/include/clang-pseudo/Language.h
29 ↗(On Diff #442666)

This allows/encourages guards that dynamically consider multiple rules, which increases the amount of coupling between rules and grammar.

Yeah, this is the sad bit of this approach, but IMO it is not that bad.

Can we express this as
function-declarator := declarator [guard=FunctionDeclarator]?
This does mangle the grammar a bit to our implementation, and introduces two names for the same concept (guard & nonterminal).

We can do this (we will also need another non-function-declarator nonterminal).

I feel uncomfortable of it, this seems a non-trivial&non-local change (we're changing an important nonterminal) -- Declarator is the only top-level declarator defined in the standard grammar, and now we add two more, and we need to update all related declaration rules. We're making it more complicated, which might introduce subtle grammar bugs.

There is a 3rd approach -- we could consider make the guard on symbol level rather than on Rule level. The guard corresponds to one of the grammar symbols on RHS.

29 ↗(On Diff #442666)

Having played with this idea a bit, I do think we should strongly consider dispatching on the rule ID rather than a named extension. I went to add some more guards (on NUMERIC_CONSTANT etc) and they really are 1:1 with rules.

This is an interesting idea. The annotation syntax [guard] in grammar is not required, as we can just use {Rule::function-declarator$$declarator, guardFunctionDeclarator} to build the map, this means we loose the connection between the grammar annotation and C++ binding code, which seems like a regression to me :(

It has a hard restriction: every grammar rule will have at most 1 guard. It is probably fine, I can't think of a case where we will use multiple guards for a single rule (even we have, we could solve it by introducing a new rule with guard).

sammccall added inline comments.Jul 7 2022, 3:59 PM
clang-tools-extra/pseudo/gen/Main.cpp
114 ↗(On Diff #442666)

I didn't see this warning on my machine, I'm using the g++.

Clang seems to enable -Wdollar-in-identifier-extension by default.

It is better, but I'm not willing to change the spelling to CamelCase in bnf grammar

Fair enough. I find this option the most readable, but the lack of consistency is annoying.

I'm starting to like it now. I find that encoding the index for RHS is useful, especially when we do some work on the ForestNode based on the rules. So I will vote this one.

Ok, i find this least readable, but good in all other respects. Let's go for it.

clang-tools-extra/pseudo/include/clang-pseudo/Language.h
29 ↗(On Diff #442666)

This is an interesting idea. The annotation syntax [guard] in grammar is not required ... This means we loose the connection between the grammar annotation and C++ binding code,

I don't really understand this argument: my suggestion was to keep the [guard] annotations to preserve this connection.
If you think that's unnecessary then we can drop it (i don't feel strongly). But it can't be both necessary and unnecessary!

29 ↗(On Diff #442666)

Having played with this idea a bit, I do think we should strongly consider dispatching on the rule ID rather than a named extension. I went to add some more guards (on NUMERIC_CONSTANT etc) and they really are 1:1 with rules.

This is an interesting idea. The annotation syntax [guard] in grammar is not required, as we can just use {Rule::function-declarator$$declarator, guardFunctionDeclarator} to build the map, this means we loose the connection between the grammar annotation and C++ binding code, which seems like a regression to me :(

It has a hard restriction: every grammar rule will have at most 1 guard. It is probably fine, I can't think of a case where we will use multiple guards for a single rule (even we have, we could solve it by introducing a new rule with guard).

It's always possible, if the rule imposes two conditions then the rule's guard can check both.

29 ↗(On Diff #442666)

Can we express this as
function-declarator := declarator [guard=FunctionDeclarator]?
This does mangle the grammar a bit to our implementation, and introduces two names for the same concept (guard & nonterminal).

We can do this (we will also need another non-function-declarator nonterminal).

I feel uncomfortable of it, this seems a non-trivial&non-local change (we're changing an important nonterminal)

To me this seems like an advantage:
This non-function-declarator is clearly an important concept being processed at grammatical parse time, it should have a name and be visible in the output!

Declarator is the only top-level declarator defined in the standard grammar, and now we add two more, and we need to update all related declaration rules. We're making it more complicated, which might introduce subtle grammar bugs.

This logic needs to live somewhere. We're using a grammar not just because the standard provides one, but because it's easier to reason about and check than imperative code.
Grammar bugs are possible, but they're not made any better by moving them out of the grammar!

There is a 3rd approach -- we could consider make the guard on symbol level rather than on Rule level. The guard corresponds to one of the grammar symbols on RHS.

Yeah. This looks more awkward to implement than rule guards, but i think the biggest difference is (not) representing "this declarator is a function declarator" in the forest.

hokein added inline comments.Jul 8 2022, 3:26 AM
clang-tools-extra/pseudo/gen/Main.cpp
114 ↗(On Diff #442666)

Sounds good, separated it in https://reviews.llvm.org/D129359.

hokein updated this revision to Diff 444976.Jul 15 2022, 7:17 AM

rebase and address the main comment -- encoding the function-declarator into the grammar.

hokein updated this revision to Diff 445427.Jul 18 2022, 2:19 AM
hokein marked 3 inline comments as done.

rebase and add comments.

hokein edited the summary of this revision. (Show Details)Jul 18 2022, 2:19 AM

All comments should be addressed now. As discussed offline, I'm going to land it now, happy to address any post comments.

sammccall accepted this revision.Jul 18 2022, 1:22 PM

impl looks good!

clang-tools-extra/pseudo/test/cxx/declarator-function.cpp
6–28

I'm not convinced that rewriting this test (rather than just removing the FIXME and XFAIL) is an improvement.

We're asserting a lot more stuff, but none of it is actually important. I find it tends to make it hard to find the signal among the noise of failures.

7

FWIW: I think that including the full contents of each line is bit unneccesary and brittle.

e.g. you can leave off the := ... without any risk of matching spuriously

clang-tools-extra/pseudo/test/cxx/declarator-var.cpp
6

this is testing essentially the same thing as the new int s2 case you added (though this one shows the subtle difference better IMO).

I think you should keep either one, but not both.

This revision is now accepted and ready to land.Jul 18 2022, 1:22 PM
hokein updated this revision to Diff 445714.Jul 19 2022, 12:48 AM
hokein marked 3 inline comments as done.

restore the previous and simplified function-declarator test.

This revision was landed with ongoing or failed builds.Jul 19 2022, 12:49 AM
This revision was automatically updated to reflect the committed changes.

It seems this change broke build on windows for me:

clang-tools-extra\pseudo\lib\cxx\CXX.cpp(59): error C2838: 'noptr_declarator_0declarator_id': illegal qualified name in member declaration
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(59): error C2065: 'noptr_declarator_0declarator_id': undeclared identifier
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(59): error C2131: expression did not evaluate to a constant
  clang-tools-extra\pseudo\lib\cxx\CXX.cpp(59): note: a non-constant (sub-)expression was encountered
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(62): error C2838: 'ptr_declarator_0ptr_operator_1ptr_declarator': illegal qualified name in member declaration
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(62): error C2065: 'ptr_declarator_0ptr_operator_1ptr_declarator': undeclared identifier
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(62): error C2131: expression did not evaluate to a constant
  clang-tools-extra\pseudo\lib\cxx\CXX.cpp(62): note: a non-constant (sub-)expression was encountered
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(68): error C2838: 'declarator_0noptr_declarator_1parameters_and_qualifiers_2trailing_return_type': illegal qualified name in member declaration
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(68): error C2065: 'declarator_0noptr_declarator_1parameters_and_qualifiers_2trailing_return_type': undeclared identifier
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(68): error C2131: expression did not evaluate to a constant
  clang-tools-extra\pseudo\lib\cxx\CXX.cpp(68): note: a non-constant (sub-)expression was encountered
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(69): error C2838: 'noptr_declarator_0noptr_declarator_1parameters_and_qualifiers': illegal qualified name in member declaration
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(69): error C2065: 'noptr_declarator_0noptr_declarator_1parameters_and_qualifiers': undeclared identifier
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(69): error C2131: expression did not evaluate to a constant
  clang-tools-extra\pseudo\lib\cxx\CXX.cpp(69): note: a non-constant (sub-)expression was encountered
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(75): error C2838: 'noptr_declarator_0noptr_declarator_1l_square_2constant_expression_3r_square': illegal qualified name in member declaration
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(75): error C2065: 'noptr_declarator_0noptr_declarator_1l_square_2constant_expression_3r_square': undeclared identifier
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(75): error C2131: expression did not evaluate to a constant
  clang-tools-extra\pseudo\lib\cxx\CXX.cpp(75): note: a non-constant (sub-)expression was encountered
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(76): error C2838: 'noptr_declarator_0noptr_declarator_1l_square_2r_square': illegal qualified name in member declaration
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(76): error C2065: 'noptr_declarator_0noptr_declarator_1l_square_2r_square': undeclared identifier
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(76): error C2131: expression did not evaluate to a constant
  clang-tools-extra\pseudo\lib\cxx\CXX.cpp(76): note: a non-constant (sub-)expression was encountered
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(81): error C2838: 'noptr_declarator_0l_paren_1ptr_declarator_2r_paren': illegal qualified name in member declaration
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(81): error C2065: 'noptr_declarator_0l_paren_1ptr_declarator_2r_paren': undeclared identifier
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(81): error C2131: expression did not evaluate to a constant
  clang-tools-extra\pseudo\lib\cxx\CXX.cpp(81): note: a non-constant (sub-)expression was encountered
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(84): error C2838: 'ptr_declarator_0noptr_declarator': illegal qualified name in member declaration
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(84): error C2065: 'ptr_declarator_0noptr_declarator': undeclared identifier
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(84): error C2131: expression did not evaluate to a constant
  clang-tools-extra\pseudo\lib\cxx\CXX.cpp(84): note: a non-constant (sub-)expression was encountered
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(85): error C2838: 'declarator_0ptr_declarator': illegal qualified name in member declaration
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(85): error C2065: 'declarator_0ptr_declarator': undeclared identifier
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(85): error C2131: expression did not evaluate to a constant
  clang-tools-extra\pseudo\lib\cxx\CXX.cpp(85): note: a non-constant (sub-)expression was encountered
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(59): error C2051: case expression not constant
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(62): error C2051: case expression not constant
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(67): error C2051: case expression not constant
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(69): error C2051: case expression not constant
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(74): error C2051: case expression not constant
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(76): error C2051: case expression not constant
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(81): error C2051: case expression not constant
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(84): error C2051: case expression not constant
clang-tools-extra\pseudo\lib\cxx\CXX.cpp(85): error C2051: case expression not constant

I confirm reverting just this patch fixes the build for me.

I confirm reverting just this patch fixes the build for me.

Hi @mizvekov,
Can you provide details about the compiler version?
Neither the presubmit nor postsubmit bots (e.g.) had this error, and I can't relate the error messages to the code (member declaration on line 59?!)
I fear this is a compiler bug we'll need to work around.

@hokein it probably still makes sense to revert...

Microsoft (R) C/C++ Optimizing Compiler Version 19.33.31627.1 for x64

As included in Microsoft Visual Studio 2022 Version 17.3.0 Preview 4.0

I un-reverted this locally to take a better look.

It does not seem to me to be a compiler bug, because both cl.exe and intellisense are picking this error up, and I believe the MSVC intellisense is based on a completely different frontend (EDG).

What I see is that enum class Rule is empty.

We try to populate it with

enum class Rule : RuleID {
#define RULE(X, Y) X = Y,
#include "CXXSymbols.inc"
#undef RULE
};

But I open CXXSymbols.inc and I see nothing that would expand that RULE macro, the contents of the file don't mention it at all, just NONTERMINAL and EXTENSION.

Running clang-pseudo-gen --grammar ..\clang-tools-extra\pseudo\lib\cxx\cxx.bnf --emit-symbol-list does seem to generate something that would work. But even if I delete CXXSymbols.inc, the cmake rule seems to be regenerating a wrong CXXSymbols.inc again, so I suspect this is a build system bug and we have wrong targets or dependecies somewhere.

The clang-pseudo-gen that sits in .\llvm\RelWithDebInfo\bin\clang-pseudo-gen generates correct output, the one that sits in .\llvm\NATIVE\Release\bin\clang-pseudo-gen.exe doesn't, which seems to be the one cmake is picking up for this job.

This could have something to do with "LLVM_OPTIMIZED_TABLEGEN": "ON", perhaps, and there is some dependency missing to rebuild the NATIVE target.

Yeah, turning off LLVM_OPTIMIZED_TABLEGEN made cmake instantly generate the correct CXXSymbols.inc. So while I think there is a bug out there somewhere, it might not necessarily be in this patch.

Thanks for digging!
Agreed then that reverting this patch won't really solve anything. I expect I can reproduce on linux by enabling LLVM_OPTIMIZED_TABLEGEN in the same way, looking into it...

I'm sorry for the trouble it causes, and thanks for all details!

This could have something to do with "LLVM_OPTIMIZED_TABLEGEN": "ON", perhaps, and there is some dependency missing to rebuild the NATIVE target.

Yeah, the native clang-pseudo-gen tool didn't rebuild somehow even its source file changes, it is a bug in our cmake config. Should be fixed in 2955192df8ac270515b5fa4aaa9e9380148e7f00 (I verified it locally). Can you retry it with LLVM_OPTIMIZED_TABLEGEN on?

Yeah, the native clang-pseudo-gen tool didn't rebuild somehow even its source file changes, it is a bug in our cmake config. Should be fixed in 2955192df8ac270515b5fa4aaa9e9380148e7f00 (I verified it locally). Can you retry it with LLVM_OPTIMIZED_TABLEGEN on?

Yeah I confirm that fixes it, thanks!

Though it's strange that you would have to declare that dependency explicitly, I think it should be implied.

One other small thing I noticed: It seems the pseudo-gen tool generated output takes care of undefining the macros it takes as input to expand, but we don't take advantage of that fact on clang-tools-extra\pseudo\include\clang-pseudo\cxx\CXX.h and are always undefing them. Just a minor thing :)

Yeah, the native clang-pseudo-gen tool didn't rebuild somehow even its source file changes, it is a bug in our cmake config. Should be fixed in 2955192df8ac270515b5fa4aaa9e9380148e7f00 (I verified it locally). Can you retry it with LLVM_OPTIMIZED_TABLEGEN on?

Yeah I confirm that fixes it, thanks!

Though it's strange that you would have to declare that dependency explicitly, I think it should be implied.

I think what's going on here: outer cmake invokes this opaque external command (which just _happens_ to be cmake --build) to produce an exe. The outer cmake has no special knowledge of what the inner cmake is going to use as inputs, so i it never invalidates the exe and invokes inner cmake again. Having the native exe depend on the target exe is a hack: it means that it (indirectly) depends on the right set of sources and is invalidated when they change. (In a perfect world we wouldn't build the target clang-pseudo-gen at all)

I think what's going on here: outer cmake invokes this opaque external command (which just _happens_ to be cmake --build) to produce an exe. The outer cmake has no special knowledge of what the inner cmake is going to use as inputs, so i it never invalidates the exe and invokes inner cmake again. Having the native exe depend on the target exe is a hack: it means that it (indirectly) depends on the right set of sources and is invalidated when they change. (In a perfect world we wouldn't build the target clang-pseudo-gen at all)

Yeah I agree, I just meant that build_native_tool (From https://reviews.llvm.org/rG2955192df8ac270515b5fa4aaa9e9380148e7f00) might as well just always add this dependency itself. I think that anything that uses it and forgets to add this dependency would be a bug anyway.