This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang-tools-extra/pseudo/lib/cxx/
-
pseudo/
-
lib/
-
cxx/
-
CXX.cpp
-
cxx.bnf

Differential D130626

[pseudo] experiment
Needs ReviewPublic

Authored by hokein on Jul 27 2022, 4:56 AM.

Download Raw Diff

This revision needs review, but there are no reviewers specified.

Details

Reviewers: None

Summary

This is not the direction we will persume, but it is an experiment to
see how many ambiguities left if we have the perfect type information
for all identifiers in the parsing file.

Just post the results

Test file clangd/ASTSignals.cpp (no ambiguity!)

Results:
Before: https://htmlpreview.github.io/?https://gist.githubusercontent.com/hokein/6910758199abc4ede5fb2c5a5553b00f/raw/0665ce71ccf6767121cd3cb49ee8bb597a8fd2f3/ASTSignalsBefore.html
after: https://htmlpreview.github.io/?https://gist.githubusercontent.com/hokein/3757357ac5787a0fe64cd1601f5c4d8f/raw/b6024fe27f7c39bcbb8d2e6c967d9561717a5af6/ASTSignals.html

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hokein created this revision.Jul 27 2022, 4:56 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 27 2022, 4:56 AM

Herald added subscribers: usaxena95, kadircet. · View Herald Transcript

hokein requested review of this revision.Jul 27 2022, 4:56 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 27 2022, 4:56 AM

Herald added subscribers: cfe-commits, alextsao1999, ilya-biryukov. · View Herald Transcript

Harbormaster completed remote builds in B177837: Diff 448004.Jul 27 2022, 4:56 AM

hokein edited the summary of this revision. (Show Details)Jul 27 2022, 5:38 AM

The solution in imperfect here (it is based on the identifier content, so it won't work if the code happen to have two same-content-but-different-kind tokens, e.g. trace::Span Span;), but it at least shows that ambiguities in identifiers are the most critical one.

And it gives some evidences that if we know all type information of identifiers, we could potentially get a perfect parse tree even without any soft disambiguation mechanism. I think It might affect our designs in some scenarios -- for example, if we have a fresh clang AST, we can first annotate all identifier tokens (what does this identifier refer to, a class, a enum etc), and use this approach to build a forest in the pseudoparser (we might not need a real disambiguation, because the forest likely ends up with a single perfect tree:)).

Had some offline discussion with @ilya-biryukov today:

If we look at all ambiguities in ASTSignals.cpp,

45 Ambiguous nodes:
   18 type-name
   14 simple-type-specifier
    5 postfix-expression
    3 namespace-name
    3 nested-name-specifier
    1 relational-expression
    1 template-argument

Most of ambiguities (>90%) are just "local", they won't affect the structure of the tree, and they seem to be less useful. If we think about the final output clang syntax-tree, we care about tree structures, and these ambiguities basically provide zero value.
For example, the nested-name-specifier trace::span case, in the clang syntax-tree we model the trace specifier as a general identifier name specifier regardless whether the trace is a type-name or namespace-name;
the simple-declaration ASTSignals Signals; case, it is sufficient to know it is a simple-declaration, and ASTSignals is a simple-type-specifier, but whether the simple-type-specifier is type-name (thus class-name, enum-name, typedef-name) or template-name is less interesting, and we probably don't want to distinguish them in the clang syntax-tree;
similar to the postfix-expression Foo(...); case, we might use the same node in the clang syntax-tree to model a function-call and an explicit class type conversion.

So one option will be to eliminate these "local" ambiguities in the forest (by replacing the type-name, class-name, enum-name, typedef-name, template-name with a generic name), as it won't affect tree-structure. Re the implementation, we can do a post-process on the forest -- replace an ambiguous forest node if all its alternatives share the same tree structure, ad-hoc targeting on type-name, simple-type-specifier, postfix-expression nonterminals is probably enough. An alternative is to adjust the cxx grammar rules (not sure how intrusive the change is);

The only "real" ambiguity is the dyn_cast<NamespaceDecl>(ND->getDeclContext()) (whether it is a postfix expression, or a pair of comparison expressions). This is a real ambiguity in C++ that requires type information to resolve. For these ambiguities, we can't eliminate them, and we do need a ranking-based disambiguation.

hokein added a subscriber: sammccall.Jul 28 2022, 1:32 AM

hokein mentioned this in D130747: [pseudo] Eliminate the type-name identifier ambiguities in the grammar..Jul 29 2022, 12:39 AM

hokein mentioned this in rG6a9f79e1020d: [pseudo] Eliminate the type-name identifier ambiguities in the grammar..Aug 17 2022, 5:31 AM

Revision Contents

Path

Size

clang-tools-extra/

pseudo/

lib/

cxx/

CXX.cpp

77 lines

cxx.bnf

20 lines

Diff 448004

clang-tools-extra/pseudo/lib/cxx/CXX.cpp

//===--- CXX.cpp - Define public interfaces for C++ grammar ---------------===//		//===--- CXX.cpp - Define public interfaces for C++ grammar ---------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "clang-pseudo/cxx/CXX.h"		#include "clang-pseudo/cxx/CXX.h"
#include "clang-pseudo/Forest.h"		#include "clang-pseudo/Forest.h"
#include "clang-pseudo/Language.h"		#include "clang-pseudo/Language.h"
#include "clang-pseudo/grammar/Grammar.h"		#include "clang-pseudo/grammar/Grammar.h"
#include "clang-pseudo/grammar/LRTable.h"		#include "clang-pseudo/grammar/LRTable.h"
#include "clang/Basic/CharInfo.h"		#include "clang/Basic/CharInfo.h"
#include "clang/Basic/TokenKinds.h"		#include "clang/Basic/TokenKinds.h"
		#include "llvm/ADT/StringMap.h"
#include "llvm/ADT/StringSwitch.h"		#include "llvm/ADT/StringSwitch.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include <utility>		#include <utility>
#define DEBUG_TYPE "CXX.cpp"		#define DEBUG_TYPE "CXX.cpp"

namespace clang {		namespace clang {
namespace pseudo {		namespace pseudo {
namespace cxx {		namespace cxx {
▲ Show 20 Lines • Show All 80 Lines • ▼ Show 20 Lines
// RHS is expected to contain a single symbol.		// RHS is expected to contain a single symbol.
// Returns the corresponding ForestNode.		// Returns the corresponding ForestNode.
const ForestNode &onlySymbol(SymbolID Kind,		const ForestNode &onlySymbol(SymbolID Kind,
const ArrayRef<const ForestNode *> RHS,		const ArrayRef<const ForestNode *> RHS,
const TokenStream &Tokens) {		const TokenStream &Tokens) {
assert(RHS.size() == 1 && RHS.front()->symbol() == Kind);		assert(RHS.size() == 1 && RHS.front()->symbol() == Kind);
return *RHS.front();		return *RHS.front();
}		}
		enum IDKind : unsigned {
		// Type
		Class = 1 << 0,
		Enum = 1 << 1,
		Typedef = 1 << 2,

		Namespace = 1 << 3,
		NamespaceAlias = 1 << 4,

		Function = 1 << 5,

		Template = 1 << 6,
		TemplateFunction = Template \| Function,
		TemplateClass = Template \| Class,
		TemplateTypedef = Template \| Typedef,
		};
		inline IDKind operator\|(IDKind L, IDKind R) {
		return static_cast<IDKind>(static_cast<unsigned>(L) \|
		static_cast<unsigned>(R));
		}
		inline IDKind operator&(IDKind A, IDKind B) {
		return static_cast<IDKind>(static_cast<unsigned>(A) & static_cast<unsigned>(B));
		}

		llvm::StringMap<IDKind> *IDTable = []() {
		auto *Results = new llvm::StringMap<IDKind>({
		{"dyn_cast", IDKind::Function \| IDKind::Template},

		});
		for (auto Namespace : {"llvm", "trace", "clangd", "clang", "std"})
		Results->insert({Namespace, IDKind::Namespace});
		for (auto Class :
		{"SourceManager", "ReferenceLoc", "NamedDecl", "NamespaceDecl",
		"NamedDecl", "ASTSignals", "ParsedAST", "ASTSignals", "Span",
		"SymbolID", "string", "NamespaceDecl"})
		Results->insert({Class, IDKind::Class});
		for (auto Expression : {"isInsideMainFile", "findExplicitReferences",
		"getSymbolID", "printNamespaceScope"})
		Results->insert({Expression, IDKind::Function});

		return Results;
		}();

bool isFunctionDeclarator(const ForestNode *Declarator) {		bool isFunctionDeclarator(const ForestNode *Declarator) {
assert(Declarator->symbol() == (SymbolID)(cxx::Symbol::declarator));		assert(Declarator->symbol() == (SymbolID)(cxx::Symbol::declarator));
bool IsFunction = false;		bool IsFunction = false;
using cxx::Rule;		using cxx::Rule;
while (true) {		while (true) {
// not well-formed code, return the best guess.		// not well-formed code, return the best guess.
if (Declarator->kind() != ForestNode::Sequence)		if (Declarator->kind() != ForestNode::Sequence)
▲ Show 20 Lines • Show All 141 Lines • ▼ Show 20 Lines	switch (N->rule()) {
default:		default:
LLVM_DEBUG(llvm::errs() << "Unhandled rule " << N->rule() << "\n");		LLVM_DEBUG(llvm::errs() << "Unhandled rule " << N->rule() << "\n");
llvm_unreachable("hasExclusiveType be exhaustive!");		llvm_unreachable("hasExclusiveType be exhaustive!");
}		}
}		}
}		}

llvm::DenseMap<ExtensionID, RuleGuard> buildGuards() {		llvm::DenseMap<ExtensionID, RuleGuard> buildGuards() {
		#define HAS_BIT(value) \
		[](const GuardParams &P) -> bool { \
		const auto &T = P.Tokens.tokens()[P.RHS.front()->startTokenIndex()]; \
		if (T.Kind != tok::identifier) \
		return true; \
		auto It = IDTable->find(T.text()); \
		if (It == IDTable->end()) \
		return true; \
		return (value)&It->second; \
		}
		#define EXACT_BIT(value) \
		[](const GuardParams &P) -> bool { \
		const auto &T = P.Tokens.tokens()[P.RHS.front()->startTokenIndex()]; \
		if (T.Kind != tok::identifier) \
		return true; \
		auto It = IDTable->find(T.text()); \
		if (It == IDTable->end()) \
		return true; \
		return (value) == It->second; \
		}

#define GUARD(cond) \		#define GUARD(cond) \
{ \		{ \
[](const GuardParams &P) { return cond; } \		[](const GuardParams &P) { return cond; } \
}		}
#define TOKEN_GUARD(kind, cond) \		#define TOKEN_GUARD(kind, cond) \
[](const GuardParams& P) { \		[](const GuardParams& P) { \
const Token &Tok = onlyToken(tok::kind, P.RHS, P.Tokens); \		const Token &Tok = onlyToken(tok::kind, P.RHS, P.Tokens); \
return cond; \		return cond; \
}		}
#define SYMBOL_GUARD(kind, cond) \		#define SYMBOL_GUARD(kind, cond) \
[](const GuardParams& P) { \		[](const GuardParams& P) { \
const ForestNode &N = onlySymbol((SymbolID)Symbol::kind, P.RHS, P.Tokens); \		const ForestNode &N = onlySymbol((SymbolID)Symbol::kind, P.RHS, P.Tokens); \
return cond; \		return cond; \
}		}

return {		return {
		{(RuleID)Rule::class_name_0identifier, HAS_BIT(Class)},
		{(RuleID)Rule::template_name_0identifier, HAS_BIT(Template)},
		{(RuleID)Rule::enum_name_0identifier, HAS_BIT(Enum)},
		{(RuleID)Rule::typedef_name_0identifier, HAS_BIT(Typedef)},
		{(RuleID)Rule::namespace_name_0identifier, HAS_BIT(Namespace)},
		{(RuleID)Rule::namespace_alias_0identifier, HAS_BIT(NamespaceAlias)},
		// expression that refers to a function.
		{(RuleID)Rule::id_expression_0unqualified_id, HAS_BIT(Function)},

		{(RuleID)Rule::class_name_0simple_template_id, EXACT_BIT((IDKind::TemplateClass))},
		{(RuleID)Rule::typedef_name_0simple_template_id, EXACT_BIT(IDKind::TemplateTypedef)},

{(RuleID)Rule::function_declarator_0declarator,		{(RuleID)Rule::function_declarator_0declarator,
SYMBOL_GUARD(declarator, isFunctionDeclarator(&N))},		SYMBOL_GUARD(declarator, isFunctionDeclarator(&N))},
{(RuleID)Rule::non_function_declarator_0declarator,		{(RuleID)Rule::non_function_declarator_0declarator,
SYMBOL_GUARD(declarator, !isFunctionDeclarator(&N))},		SYMBOL_GUARD(declarator, !isFunctionDeclarator(&N))},

// A {decl,type,defining-type}-specifier-sequence cannot have multiple		// A {decl,type,defining-type}-specifier-sequence cannot have multiple
// "exclusive" types (like class names): a value has only one type.		// "exclusive" types (like class names): a value has only one type.
{(RuleID)Rule::		{(RuleID)Rule::
▲ Show 20 Lines • Show All 132 Lines • Show Last 20 Lines

clang-tools-extra/pseudo/lib/cxx/cxx.bnf

	Show All 28 Lines
	# We list important nonterminals as start symbols, rather than doing it for all			# We list important nonterminals as start symbols, rather than doing it for all
	# nonterminals by default, this reduces the number of states by 30% and LRTable			# nonterminals by default, this reduces the number of states by 30% and LRTable
	# actions by 16%.			# actions by 16%.
	_ := translation-unit			_ := translation-unit
	_ := statement-seq			_ := statement-seq
	_ := declaration-seq			_ := declaration-seq

	# gram.key			# gram.key
	typedef-name := IDENTIFIER			typedef-name := IDENTIFIER [guard]
	typedef-name := simple-template-id			typedef-name := simple-template-id [guard]
	namespace-name := IDENTIFIER			namespace-name := IDENTIFIER [guard]
	namespace-name := namespace-alias			namespace-name := namespace-alias [guard]
	namespace-alias := IDENTIFIER			namespace-alias := IDENTIFIER [guard]
	class-name := IDENTIFIER			class-name := IDENTIFIER [guard]
	class-name := simple-template-id			class-name := simple-template-id [guard]
	enum-name := IDENTIFIER			enum-name := IDENTIFIER [guard]
	template-name := IDENTIFIER			template-name := IDENTIFIER [guard]

	# gram.basic			# gram.basic
	#! Custom modifications to eliminate optional declaration-seq			#! Custom modifications to eliminate optional declaration-seq
	translation-unit := declaration-seq			translation-unit := declaration-seq
	translation-unit := global-module-fragment_opt module-declaration declaration-seq_opt private-module-fragment_opt			translation-unit := global-module-fragment_opt module-declaration declaration-seq_opt private-module-fragment_opt

	# gram.expr			# gram.expr
	# expr.prim			# expr.prim
	primary-expression := literal			primary-expression := literal
	primary-expression := THIS			primary-expression := THIS
	primary-expression := ( expression )			primary-expression := ( expression )
	primary-expression := id-expression			primary-expression := id-expression
	primary-expression := lambda-expression			primary-expression := lambda-expression
	primary-expression := fold-expression			primary-expression := fold-expression
	primary-expression := requires-expression			primary-expression := requires-expression
	id-expression := unqualified-id			id-expression := unqualified-id [guard]
	id-expression := qualified-id			id-expression := qualified-id
	unqualified-id := IDENTIFIER			unqualified-id := IDENTIFIER
	unqualified-id := operator-function-id			unqualified-id := operator-function-id
	unqualified-id := conversion-function-id			unqualified-id := conversion-function-id
	unqualified-id := literal-operator-id			unqualified-id := literal-operator-id
	unqualified-id := ~ type-name			unqualified-id := ~ type-name
	unqualified-id := ~ decltype-specifier			unqualified-id := ~ decltype-specifier
	unqualified-id := template-id			unqualified-id := template-id
	▲ Show 20 Lines • Show All 707 Lines • Show Last 20 Lines