This is an archive of the discontinued LLVM Phabricator instance.

[pseudo] Reorganize CXX.h enums
ClosedPublic

Authored by sammccall on Jul 23 2022, 12:55 AM.

Details

Summary
  • Place rules under rule::lhs::rhsrhsrhs
  • Change mangling of keywords to ALL_CAPS (needed to turn keywords that appear alone on RHS into valid identifiers)
  • Make enums implicitly convertible to underlying type (though still scoped, using alias tricks)

In principle this lets us exhaustively write a switch over all rules of a NT:

switch ((rule::declarator)N->rule()) {
  case rule::declarator::noptr_declarator:
    ...
}

In practice we don't do this anywhere yet as we're often switching over multiple
nonterminal kinds at once.

Diff Detail

Event Timeline

sammccall created this revision.Jul 23 2022, 12:55 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 23 2022, 12:55 AM
sammccall requested review of this revision.Jul 23 2022, 12:55 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 23 2022, 12:55 AM

Thanks for exploring the idea, this looks like a good start to me.

clang-tools-extra/pseudo/include/clang-pseudo/cxx/CXX.h
54

why there is a dummy namespace here?

clang-tools-extra/pseudo/lib/cxx/CXX.cpp
121–122

The code of applying the pattern doesn't look much worse to me and it is easier to verify the exhaustiveness by human as well. We might loose some performance (I hope not a lot), but I think it is a good tradeoff (not saying we need do it in this patch).

switch (LHSNonterminal(Declarator->rule(), *G)) {
   case cxx::Symbol::noptr_declarator: {
      switch ((rule::noptr_declarator)Declarator->rule())  {
         case rule::noptr_declarator::declarator_id:
             ....
         case rule::noptr_declarator::noptr_declarator__parameters_and_qualifiers:
             ....
      }
   ...
   }
}
276

A nice improvement

clang-tools-extra/pseudo/lib/grammar/Grammar.cpp
47–48

IMO, this improves the readability, and also aligns with the text in the cxx.grammar.

47–48

nit: we don't allow nullable nonterminals, so the RHS should never be empty. we could just initialize MangleName with mangleSymbol(R.seq()[0]), and iterate from 1 (then the trailing two pop_back is not needed).

47–48

nit: update the doc comment in .h file.

48

ok, now we're dropping the index for all RHS symbols. Just want to know your thought about this. Personally, I'd like to keep this information (noptr_declarator__l_square__constant_expression__r_square vs noptr_declarator0_l_square1_constant_expression2_r_square3) though it makes the name a bit uglier.

Change mangling of keywords to ALL_CAPS (needed to turn keywords that appear alone on RHS into valid identifiers)

if we add index number to the name, then this change is not required.

sammccall added inline comments.Jul 25 2022, 1:36 AM
clang-tools-extra/pseudo/include/clang-pseudo/cxx/CXX.h
54

for each NT, we close the previous ns+enum and open new ones.
For this to work for the first NT, we have to open an ns+enum.

I can add a comment, but would prefer to do this some other way?

clang-tools-extra/pseudo/lib/cxx/CXX.cpp
121–122

I guess this is a question of taste, but I find that style very hard to read.

(Note that it's incorrectly indented, and the indentation rules around switch/case are one of the main reasons I find nested switches confusing)

clang-tools-extra/pseudo/lib/grammar/Grammar.cpp
47–48

Thanks. I like this change too. We still have semi vs ; (should we use SEMI?) but it feels clearer

47–48

Honestly I would rather just move the impl back into gen - this change seems to demonstrate why

48

Short version: I can deal with the numbers at the front (there's a *little* bit of value), but at the back of the symbol name there's no value, just noise. I find the double-underscore version much more readable (than either variant with numbers).


I always found the indexes aesthetically ugly but OK when you could read them as

lhs _   0 rhsa_ 1 rhsb
lhs := [0]rhsa [1]rhsb

But an identifier can't start with a number (rule::empty_statement::0semi) this isn't possible. I think you suggested shifting the numbers to the end, but I can't find a way to read that, it adds too much visual noise. (e.g. because the number "decorations" don't line up in a column in code completion or a switch statement).

I also think the double-underscore version is significantly less cryptic for new readers.

usaxena95 added inline comments.Jul 25 2022, 4:55 AM
clang-tools-extra/pseudo/include/clang-pseudo/cxx/CXX.h
54–62

I would include this block in the clang-format off block to show these are for the generated code.

//clang-format off
namespace dummy { enum Rule {
...
};}
//clang-format on
sammccall updated this revision to Diff 447286.Jul 25 2022, 5:23 AM

tweak formatting

hokein added inline comments.Jul 26 2022, 5:02 AM
clang-tools-extra/pseudo/include/clang-pseudo/cxx/CXX.h
54

ah, I get it (until I read the preprocessed CXX.h code) -- I found it really hard to understand it by literally reading the code here.

To make it clear, I think we can probably add two additional RULE_BEGIN, RULE_END macros?

in CXXSymbols.inc, we emit something like

RULE_BEGIN(_)
RULE(_, translation_unit, 135)
RULE(_, statement_seq, 136)
RULE(_, declaration_seq, 137))
RULE_END(_)

In CXX.h, we write

#define RULE_BEGIN(LHS) namespace LHS { enum Rule : RULE ID {
#define RULE_END()  }; }
clang-tools-extra/pseudo/lib/cxx/CXX.cpp
121–122

(Note that it's incorrectly indented, and the indentation rules around switch/case are one of the main reasons I find nested switches confusing)

Ah the indentation is weird (it is not what I originally written..). There are ways to make the indentation correct (turn off the clang-format etc).

If nested switch is hard to read, maybe we can wrap one with a macro to improve the code readability (happy to explore ideas, I think there is value on this pattern).

clang-tools-extra/pseudo/lib/grammar/Grammar.cpp
47–48

yeah, that looks good (that means lowercase is always referring to nonterminals)

47–48

ok, feel free to move it.

48

I find the double-underscore version much more readable (than either variant with numbers)

I agree that the double-underscore is more readable, but I don't have a much-better feeling. I'm ok with the double-underscore one if you feel strong about it.

The reason why I value the index is that it is easier to spot the index and its corresponding RHS element, and it could potentially avoid writing bugs (specially in the guard function implementation,
thinking about accessing the RHS element in the looong rule l_paren__r_paren__decl_specifier_seq__noexcept_specifier__trailing_return_type__requires_clause).

sammccall marked 6 inline comments as done.Jul 26 2022, 7:04 AM
sammccall added inline comments.
clang-tools-extra/pseudo/include/clang-pseudo/cxx/CXX.h
54

That would make the callsite more clear, but at the cost of adding ad-hoc bits to the CXXSymbols.inc.
I'd prefer to keep it generic if we can (otherwise we might as well just have Gen generate the enum definitions directly, right?)

Added a comment to explain the dummy.

54–62

Oops, they're not for the generated code, just for the macro definition (clang-format gets very confused)

Restricted the scope of the formatting-disabled block and tried to improve the hand-formatting (though I don't think any formatting of this particular preprocessor trick is very readable)

clang-tools-extra/pseudo/lib/cxx/CXX.cpp
121–122

Maybe - I think nonstandard formatting and macros cost a lot of readability in their own right.

I think we should probably discuss this separately - I think this change is worth having even if we don't change control flow to exploit in in all cases, WDYT?

clang-tools-extra/pseudo/lib/grammar/Grammar.cpp
47–48

I go back and forth: I think the use-mention distinction is important to readability too:

  • initializer_list means an initializer list goes here
  • INT does *not* mean an integer goes here, but rather the *word* "int"
  • by this test, comma should be lowercase, because it's a comma, not the word "comma" that goes here

(Now I'm actually wondering if we should go back and lowercase IDENTIFIER, NUMERIC_CONSTANT etc in the bnf file...)

In any case, let's uppercase SEMI for now because anything else leaves ugly inconsistency somewhere.

48

The reason why I value the index is that it is easier to spot the index and its corresponding RHS element

Right, I agree, but at least for me this benefit vanishes entirely if the index has to come at the end of the name instead of the start.

I don't feel strongly about double-underscore over 0foo_1bar (which isn't a valid identifier), but I do feel strongly about it over foo0_bar1.

Losing the indices is a tradeoff, but I don't feel like it's terrible (it's bad in the example you gave, but it's not the common case: median RHS tokens is 1, the average is 2, and 85% have <=3)

sammccall updated this revision to Diff 447675.Jul 26 2022, 7:05 AM
sammccall marked an inline comment as done.

address comments

hokein accepted this revision.Jul 26 2022, 10:58 AM

I think this patch is in a good state.

clang-tools-extra/pseudo/include/clang-pseudo/cxx/CXX.h
54

That would make the callsite more clear, but at the cost of adding ad-hoc bits to the CXXSymbols.inc.

I think this is the point. We read this cxx.h more often than the CXXSymbols.inc, this seems to be a right tradeoff to me.

I'd prefer to keep it generic if we can (otherwise we might as well just have Gen generate the enum definitions directly, right?)

And yeah, generating the whole definitions from the Gen tool is another option.

OK, if you prefer to do it in this way, could you at least rename the enum Rule { to something like enum PlaceHolder {? It is very confusing to use the meaningful name Rule for this dummy enum.

clang-tools-extra/pseudo/lib/cxx/CXX.cpp
121–122

yeah, it should not block this change.

clang-tools-extra/pseudo/unittests/GrammarTest.cpp
112

nit: I think the test is not valid now because we don't have mangleRule in the Grammar.

This revision is now accepted and ready to land.Jul 26 2022, 10:58 AM
This revision was landed with ongoing or failed builds.Jul 27 2022, 12:04 AM
This revision was automatically updated to reflect the committed changes.
sammccall marked 2 inline comments as done.
dyung added a subscriber: dyung.Jul 27 2022, 12:57 AM

Hi @sammccall, your change seems to be causing a build failure on the PS4 linux bot, can you take a look?

https://lab.llvm.org/buildbot/#/builders/139/builds/25678

FAILED: tools/clang/tools/extra/pseudo/gen/CMakeFiles/clang-pseudo-gen.dir/Main.cpp.o 
CCACHE_CPP2=yes CCACHE_HASHDIR=yes /usr/bin/ccache /usr/bin/g++ -DGTEST_HAS_RTTI=0 -D_DEBUG -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/tools/clang/tools/extra/pseudo/gen -I/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/llvm-project/clang-tools-extra/pseudo/gen -I/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/llvm-project/clang/include -I/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/tools/clang/include -I/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/include -I/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/llvm-project/llvm/include -I/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/llvm-project/clang-tools-extra/pseudo/include -I/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/tools/clang/tools/extra/pseudo/include -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -pedantic -Wno-long-long -Wimplicit-fallthrough -Wno-maybe-uninitialized -Wno-class-memaccess -Wno-redundant-move -Wno-pessimizing-move -Wno-noexcept-type -Wdelete-non-virtual-dtor -Wsuggest-override -Wno-comment -Wmisleading-indentation -fdiagnostics-color -ffunction-sections -fdata-sections -fno-common -Woverloaded-virtual -fno-strict-aliasing -O3 -DNDEBUG  -fno-exceptions -fno-rtti -UNDEBUG -std=c++14 -MD -MT tools/clang/tools/extra/pseudo/gen/CMakeFiles/clang-pseudo-gen.dir/Main.cpp.o -MF tools/clang/tools/extra/pseudo/gen/CMakeFiles/clang-pseudo-gen.dir/Main.cpp.o.d -o tools/clang/tools/extra/pseudo/gen/CMakeFiles/clang-pseudo-gen.dir/Main.cpp.o -c /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/llvm-project/clang-tools-extra/pseudo/gen/Main.cpp
/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/llvm-project/clang-tools-extra/pseudo/gen/Main.cpp: In function ‘std::string clang::pseudo::{anonymous}::mangleSymbol(clang::pseudo::SymbolID, const clang::pseudo::Grammar&)’:
/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/llvm-project/clang-tools-extra/pseudo/gen/Main.cpp:73:50: error: expected primary-expression before ‘]’ token
   73 |   static std::string *TokNames = new std::string[]{
      |                                                  ^
/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/llvm-project/clang-tools-extra/pseudo/gen/Main.cpp:77:7: error: too many initializers for ‘std::string [1]’ {aka ‘std::__cxx11::basic_string<char> [1]’}
   77 |       };
      |       ^

Sorry Doug!
7b70c2e75cd6b21cf9c2ea2f2ce5f7b9c4202549 should fix, not sure why my local
clang accepts it.