This is an archive of the discontinued LLVM Phabricator instance.

[pseudo] experiment
Needs ReviewPublic

Authored by hokein on Jul 27 2022, 4:56 AM.
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None
Summary

This is not the direction we will persume, but it is an experiment to
see how many ambiguities left if we have the perfect type information
for all identifiers in the parsing file.

Just post the results

Test file clangd/ASTSignals.cpp (no ambiguity!)

Results:
Before: https://htmlpreview.github.io/?https://gist.githubusercontent.com/hokein/6910758199abc4ede5fb2c5a5553b00f/raw/0665ce71ccf6767121cd3cb49ee8bb597a8fd2f3/ASTSignalsBefore.html
after: https://htmlpreview.github.io/?https://gist.githubusercontent.com/hokein/3757357ac5787a0fe64cd1601f5c4d8f/raw/b6024fe27f7c39bcbb8d2e6c967d9561717a5af6/ASTSignals.html

Diff Detail

Event Timeline

hokein created this revision.Jul 27 2022, 4:56 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 27 2022, 4:56 AM
hokein requested review of this revision.Jul 27 2022, 4:56 AM
hokein edited the summary of this revision. (Show Details)Jul 27 2022, 5:38 AM

The solution in imperfect here (it is based on the identifier content, so it won't work if the code happen to have two same-content-but-different-kind tokens, e.g. trace::Span Span;), but it at least shows that ambiguities in identifiers are the most critical one.

And it gives some evidences that if we know all type information of identifiers, we could potentially get a perfect parse tree even without any soft disambiguation mechanism. I think It might affect our designs in some scenarios -- for example, if we have a fresh clang AST, we can first annotate all identifier tokens (what does this identifier refer to, a class, a enum etc), and use this approach to build a forest in the pseudoparser (we might not need a real disambiguation, because the forest likely ends up with a single perfect tree:)).

Had some offline discussion with @ilya-biryukov today:

If we look at all ambiguities in ASTSignals.cpp

45 Ambiguous nodes:
   18 type-name
   14 simple-type-specifier
    5 postfix-expression
    3 namespace-name
    3 nested-name-specifier
    1 relational-expression
    1 template-argument

Most of ambiguities (>90%) are just "local", they won't affect the structure of the tree, and they seem to be less useful. If we think about the final output clang syntax-tree, we care about tree structures, and these ambiguities basically provide zero value. 
For example,  the nested-name-specifier trace::span case, in the clang syntax-tree we model the trace specifier as a general identifier name specifier regardless whether the trace is a type-name or namespace-name;
the simple-declaration ASTSignals Signals; case, it is sufficient to know it is a simple-declaration, and ASTSignals is a simple-type-specifier, but whether the simple-type-specifier is type-name (thus class-name, enum-name, typedef-name) or template-name is less interesting, and we probably don't want to distinguish them in the clang syntax-tree;
similar to the postfix-expression Foo(...); case, we might use the same node in the clang syntax-tree to model a function-call and an explicit class type conversion.

So one option will be to eliminate these "local" ambiguities in the forest (by replacing the type-name, class-name, enum-name, typedef-name, template-name with a generic name), as it won't affect tree-structure. Re the implementation, we can do a post-process on the forest -- replace an ambiguous forest node if all its alternatives share the same tree structure, ad-hoc targeting on type-name, simple-type-specifier, postfix-expression nonterminals is probably enough. An alternative is to adjust the cxx grammar rules (not sure how intrusive the change is);

The only "real" ambiguity is the dyn_cast<NamespaceDecl>(ND->getDeclContext()) (whether it is a postfix expression, or a pair of comparison expressions). This is a real ambiguity in C++ that requires type information to resolve. For these ambiguities, we can't eliminate them, and we do need a ranking-based disambiguation.