This is an archive of the discontinued LLVM Phabricator instance.

[pseudo] Eliminate multiple-specified-types ambiguities using guards
ClosedPublic

Authored by sammccall on Jul 22 2022, 2:29 AM.

Details

Summary

Motivating case: foo bar; is not a declaration of nothing with foo and bar
both types.

This is a common and critical ambiguity, clangd/AST.cpp has 20% fewer
ambiguous nodes (1674->1332) after this change.

This could benefit from caching, but there's no caching infra in this patch.

Diff Detail

Event Timeline

sammccall created this revision.Jul 22 2022, 2:29 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 22 2022, 2:29 AM
sammccall requested review of this revision.Jul 22 2022, 2:29 AM
sammccall edited the summary of this revision. (Show Details)Jul 22 2022, 2:29 AM
sammccall added a reviewer: usaxena95.
sammccall added a comment.EditedJul 22 2022, 2:32 AM

Implementation choices here:

  • we walk through the whole tree everytime we see decl-specifier-seq: this is dumb but we to reuse work we either need to add extra rules to the grammar (D130150) or an explicit cache. Caching is likely a good idea but not added in this patch.
  • we handle *all* rules for the interesting node types explicitly, rather than default: return false. This allows us to assert that all cases are handled, so things don't "fall through the cracks" after grammar changes. Alternative suggestions welcome. (I have a feeling this "exhaustive pattern-matching" idea will come up again...)
  • mix of iteration and recursion in the implementation: I suspect this doesn't matter much, as we'll rework it when adding caching.
sammccall updated this revision to Diff 446753.Jul 22 2022, 2:34 AM

use all_of

FWIW, as-is with no caching, this is a ~2% slowdown on my machine (5.82 -> 5.72 MB/s on SemaCodeComplete.cpp).
Whereas D130150 using the grammar is a a 7% speedup (5.82 -> 6.22), so roughly an 9% performance difference between the approaches.
My guess is we'll get some but not all of this back through caching, as hashing isn't free and we'll increase the size of our working set.

FWIW, as-is with no caching, this is a ~2% slowdown on my machine (5.82 -> 5.72 MB/s on SemaCodeComplete.cpp).
Whereas D130150 using the grammar is a a 7% speedup (5.82 -> 6.22), so roughly an 9% performance difference between the approaches.
My guess is we'll get some but not all of this back through caching, as hashing isn't free and we'll increase the size of our working set.

And indeed, I see about 6.0MB/s with a simple llvm::DenseMap<ForestNode*, bool>, so I expect we can get ~half the performance back.

Thanks, the change looks good in general.

we handle *all* rules for the interesting node types explicitly, rather than default: return false. This allows us to assert that all cases are handled, so things don't "fall through the cracks" after grammar changes. Alternative suggestions welcome. (I have a feeling this "exhaustive pattern-matching" idea will come up again...)

The dangerous bit is that now we will crash at the runtime if the parsing code triggers a missing case.

Yeah, I hit this problem in the previous function-declarator patch. One approach will be to define an enum for each nonterminal enum Nonterminal { rhs0_rhs1 }, rather than put all rules into a single enum. It is easier to enumerate all values for a single nonterminal (I think this is the common case)

namespace cxx {
namespace rule {

enum simple_type_specifier {
  builtin_type0,
  nested_name_specifier0_type_name1,
  ...
}

}
}

FWIW, as-is with no caching, this is a ~2% slowdown on my machine (5.82 -> 5.72 MB/s on SemaCodeComplete.cpp).

Actually, the number looks quite good to me (I expected that there will be a significant slowdown without caching). I think we can check in the current version, and add the caching stuff afterwards.

Whereas D130150 using the grammar is a a 7% speedup (5.82 -> 6.22), so roughly an 9% performance difference between the approaches.

Yeah, because the grammar-based approach reduces the number of ambiguous node in the forest.

clang-tools-extra/pseudo/lib/cxx/CXX.cpp
284

nit: I would suggest using the index explicitly P.RHS[0], P.RHS[1], it increases the readability (the rul name encoding the index, easier to spot the corresponding element).

clang-tools-extra/pseudo/lib/cxx/cxx.bnf
353

offtopic comment: The sad bit of the RuleID approach (vs guard=SingleType) is that we don't really know what kind of the guard is by reading the grammar file only.

I think this is critical information, and worth to keep in the grammar file. (ideas: add comments, or bring the guard=SingleType in the grammar again, but we ignore the guard value in the grammar parsing).

385

I think the reason to leave SHORT/LONG/SIGNED UNSIGNED as-is is that they can combined with other type (e.g. short int).

Can we group them together, and add a comment?

sammccall marked 2 inline comments as done.Jul 22 2022, 6:07 AM
sammccall added inline comments.
clang-tools-extra/pseudo/lib/cxx/cxx.bnf
353

Yeah, the current balance doesn't feel obviously right.

I'd like to leave this for the time being, because of the various options (remove the annotations, replace them with comments, bring back values), i'm not sure there's a clear winner.

I have a suspicion that while it's appealing now to at least reference here all the restrictions that may apply, when we add "soft" disambiguation based on scoring it may not be so appealing as we won't be documenting the things that affect the parse in practice.

385

Grouped them.
I don't think the idea that SHORT is a specifier but not actually a type needs to be spelled out, but added a comment about builtin-type (which is nonstandard) which hints at this.

sammccall marked an inline comment as done.Jul 22 2022, 6:11 AM

Thanks, the change looks good in general.

we handle *all* rules for the interesting node types explicitly, rather than default: return false. This allows us to assert that all cases are handled, so things don't "fall through the cracks" after grammar changes. Alternative suggestions welcome. (I have a feeling this "exhaustive pattern-matching" idea will come up again...)

The dangerous bit is that now we will crash at the runtime if the parsing code triggers a missing case.

Yeah, grammar changes make this brittle.
Still, hopefully we canary our releases...

Yeah, I hit this problem in the previous function-declarator patch. One approach will be to define an enum for each nonterminal enum Nonterminal { rhs0_rhs1 }, rather than put all rules into a single enum. It is easier to enumerate all values for a single nonterminal (I think this is the common case)

Well, I don't think it's the most common (vs just targeting a rule or two) but certainly we never enumerate *all* the rules!
Interesting idea.

I'm not sure it'd be worth adding control-flow here to split up the switches by rule type though, switches are pretty heavyweight syntactically.

FWIW, as-is with no caching, this is a ~2% slowdown on my machine (5.82 -> 5.72 MB/s on SemaCodeComplete.cpp).

Actually, the number looks quite good to me (I expected that there will be a significant slowdown without caching). I think we can check in the current version, and add the caching stuff afterwards.

I'm not so happy...

The relevant baseline here IMO is the +7% version, as we're clearly currently paying ~8% cost to build all the silly incorrect parses, and that cost is artificial.
So 9% overall slowdown now, and still 5% with the cache, feels significant. But we can change this later.

Whereas D130150 using the grammar is a a 7% speedup (5.82 -> 6.22), so roughly an 9% performance difference between the approaches.

Yeah, because the grammar-based approach reduces the number of ambiguous node in the forest.

*Both* approaches do that, which shows how slow the guard version is :-(

LMK if anything else blocking here. I want to take a stab at changing the enums (cool idea), but I don't think there's much point blocking this patch on it.
Better to use it as a testbed for the change.

The change looks good to me.

LMK if anything else blocking here.

I don't want to block you, but I'd suggest postponing it a little bit until we collect some metrics in our internal pipeline (I think usaxena95@ is working on it, hopefully we will get it next week).

I want to take a stab at changing the enums (cool idea), but I don't think there's much point blocking this patch on it.

Agree.

Well, I don't think it's the most common (vs just targeting a rule or two) but certainly we never enumerate *all* the rules!
Interesting idea.

If we look at the existing guard implementations, we have a few of these usages:

  • in isFunctionDeclarator, we enumerate all rules of noptr_declarator, ptr_declarator, declarator ;
  • in hasExclusiveType, we enumerate all rules of decl_specifier, simple_type_specifier, type_specifier, type_specifier_seq etc;
hokein accepted this revision.Jul 22 2022, 6:33 AM
This revision is now accepted and ready to land.Jul 22 2022, 6:33 AM

If we look at the existing guard implementations, we have a few of these usages:

  • in isFunctionDeclarator, we enumerate all rules of noptr_declarator, ptr_declarator, declarator ;
  • in hasExclusiveType, we enumerate all rules of decl_specifier, simple_type_specifier, type_specifier, type_specifier_seq etc;

In both cases, we're enumerating all the rules of *multiple* nonterminals.
To get the compiler to verify exhaustiveness, we'd have to first switch over target symbol, then switch over rule ID. I expect this to be both a readability and performance hit, so I'm not sure we'll actually do it.
Still, rule::simple_declaration::decl_specifier_seq__declarator-seq__SEMI is a step up in readability and code-completion IMO. Sent https://reviews.llvm.org/D130414

Feel free to land it. We have some number now.

This revision was landed with ongoing or failed builds.Jul 25 2022, 3:57 AM
This revision was automatically updated to reflect the committed changes.