This is an archive of the discontinued LLVM Phabricator instance.

[pseudo] Add recovery for declarations
Needs ReviewPublic

Authored by hokein on Jul 25 2022, 12:25 AM.

Details

Reviewers
sammccall
Summary
  • declaration recovery strategy: search for likely declaration boundaries
  • change decl-sequence to right-recursive to simplify this
  • also use for class members (already right-recursive)
  • refactor recovery strategy to use struct parameter and expose cursor position

Diff Detail

Event Timeline

sammccall created this revision.Jul 25 2022, 12:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 25 2022, 12:25 AM
sammccall requested review of this revision.Jul 25 2022, 12:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 25 2022, 12:25 AM

This may well need refinement & more testing, interested in initial thoughts

hokein added inline comments.Jul 28 2022, 6:37 AM
clang-tools-extra/pseudo/lib/cxx/CXX.cpp
426

I think this excludes the parameter-declaration (from the grammar spec, it is a different nonterminal).

426

The logic of implementations looks very solid. I think it would be good to have some unittests for it.

441

if I read the code correctly, the following case should work like

void foo() {}

LLVM_DISCARD int bar() {} // this is an opaque `declaration` node?

void foo2() {} // it won't work if we change void -> Foo.
450

the list looks reasonable -- I think we can lookup the Follow(declaration) set to find more.

460

maybe consider tok::identifier (if the identifier is a type specifier, it is a good indicator of a declaration starting) as well? And yeah, we're less certain about it, but since we use the indentation information, I guess it might probably work well?

511

I think we miss to check the return index >= Cursor here.

clang-tools-extra/pseudo/lib/cxx/cxx.bnf
308

Just want to spell out, this causes a performance hit :(
(a pure change of this has a ~6% slowdown, 7.17 MB/s -> 6.7MB/s on SemaCodeComplete.cpp).

left-cursive grammar is more friendly to LR-like parsers, because the parser can do the reduction earlier (after a parsed a declaration, we can perform a declaration-seq reduce); while right-cursive grammar, the reduction starts from the last declaration (this means our GSS is more deeper).

change decl-sequence to right-recursive to simplify this

I remembered we discussed this bit before, but I don't really remember the details now :(

clang-tools-extra/pseudo/test/cxx/recovery-declarations.cpp
7

nit: add some tests to cover the initializer, e.g.

auto lambda = [] xxxx () xxx {};
hokein commandeered this revision.Sep 21 2022, 6:36 AM
hokein edited reviewers, added: sammccall; removed: hokein.
hokein added inline comments.
clang-tools-extra/pseudo/lib/cxx/CXX.cpp
492

P.Tokens.tokens()[P.Begin].next() doesn't seem like correct, we should start with P.Tokens.tokens()[P.Begin], otherwise we will ignore the first token,

example:

namespace XXXX {
}
foo foo;
501

we should return index() +1

506

the same, +1.

I'm going to pick this up.

hokein updated this revision to Diff 462107.Sep 22 2022, 1:40 AM
  • rebase, fixed a few conflicts
  • address comments
  • fix some bugs
  • restructure the code, moving the implementation to a separate file
  • add unittests for recoveryNextDeclaration
  • keep the left-recursive declaration-seq grammar rule to void performance regression

I think the current version should be good to review, some bits need to take care of:

  • left-recursive vs right-recursive of declaration-seq rule, see my other comments about it. Currently I keep it as-is to avoid the performance regression on large files (I still don't see your point how right-recursive can simplify the error recovery, it seems to me either can work at least for this case)
  • recoveryNextDeclaration will consume at least 1 token (the return token index > Cursor) to avoid the risk of running in an infinite loop
  • I keep the original implementation as-is (except fixing some bugs), but IMO T->Indent == OrigIndent && LikelyStartsDeclaration(T->Kind) is too strict (see my FIXME testcases), and we should probably include the IDENTIFIER (I believe it is a common case), LikelyStartsDeclaration(T->Kind) || (T->Kind = IDENTIFIER && T->Indent == OrigIndent) might be better.
clang-tools-extra/pseudo/lib/cxx/CXX.cpp
450

added more: kw_friend, kw_inline, kw_explicit, kw_virtual.

clang-tools-extra/pseudo/lib/cxx/cxx.bnf
308

Got more data, it does hurt the performance on large files:

SemaExpr.cpp: 9.54MB/s -> 7.98MB/s
SemaDeclCXX.cpp: 10.16MB/s -> 9.25MB/s

small/medium files doesn't seem to be affected (hypothesis: these files has less declarations than large files, thus being less affected by the right-recursive declaration-seq grammar rule)

Hover.cpp: 8MB/s -> 7.92MB/s
ASTSignals.cpp: 12.34M/s -> 12.29MB/s

hokein updated this revision to Diff 462448.Sep 23 2022, 5:19 AM

Fix a bug caused by my change, and add test.