This is an archive of the discontinued LLVM Phabricator instance.

[pseudo] Split greatergreater token.
ClosedPublic

Authored by hokein on Mar 15 2022, 2:15 AM.

Details

Summary

For a >> token (a right shift operator, or a nested template?), the clang
lexer always returns a single greatergreater token, as a result,
the grammar-based GLR parser never try to parse the nested template
case.

We derive a token stream by always splitting the >> token, so that the
GLR parser is able to pursue both options during parsing (usually 1
path fails).

Diff Detail

Event Timeline

hokein created this revision.Mar 15 2022, 2:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 15 2022, 2:15 AM
hokein requested review of this revision.Mar 15 2022, 2:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 15 2022, 2:15 AM
hokein updated this revision to Diff 415377.Mar 15 2022, 3:51 AM

define a greatergreater nonterminal per discussion.

sammccall accepted this revision.Mar 15 2022, 3:54 AM

This seems like the right approach to me.

It seems a bit strange to be doing this whether we're parsing C++ or not, but I guess we're not going to want low-level grammar differences between C and C++ if we can avoid it.

clang/lib/Tooling/Syntax/Pseudo/Token.cpp
97 ↗(On Diff #415363)

What do you think about combining this into cook()?
Seems a bit wasteful to do a separate pass for this. (We can always split it out again later if cook() gets too complicated)

107 ↗(On Diff #415363)

I don't think we'll ever fix this, I'd drop the FIXME and add a ! at the end :-)

112 ↗(On Diff #415363)

why not assert at the top and remove the if?

This can only be caused by a really bad programmer error, I don't think we should worry about defending against it making it into production.

clang/lib/Tooling/Syntax/Pseudo/cxx.bnf
102 ↗(On Diff #415363)

we should rather indirect this change through a new nonterminal:

greater-greater := > >
fold-operator := greater-greater

This lets us document the change in one place, and makes it easier to avoid the false parses if they prove important. We could restrict the first rule to only match if the second token has a mergeable bit or something.

This revision is now accepted and ready to land.Mar 15 2022, 3:54 AM
hokein updated this revision to Diff 415387.Mar 15 2022, 4:46 AM
hokein marked 4 inline comments as done.

address comments

Herald added a project: Restricted Project. · View Herald TranscriptMar 17 2022, 5:45 AM
This revision was landed with ongoing or failed builds.Mar 17 2022, 5:47 AM
This revision was automatically updated to reflect the committed changes.

Hello,

It seems like the new testcase fails when run with sanitizers:
https://lab.llvm.org/buildbot/#/builders/5/builds/20833

This fails on windows bot at https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket/8819402935494502225/+/u/package_clang/stdout?format=raw.

C:\b\s\w\ir\cache\builder\src\third_party\llvm\clang-tools-extra\pseudo\unittests\TokenTest.cpp(190): error: Value of: Split.tokens()


 Expected: has 5 elements where


 element #0 token (">", 51),


 element #1 token (">", 51),


 element #2 token (">", 51),


 element #3 token (">", 51),


 element #4 token (">>=", 54)


   Actual: { greater 1:0 ">" flags=1, greater 1:0 ">" flags=1, greater 3:0 "\80" flags=1, greater 3:0 "\10" flags=1, greatergreaterequal 5:0 ">>=" flags=1 }, whose element #2 doesn't match


 [  FAILED  ] TokenTest.SplitGreaterGreater (1 ms)

We had the same thing happen downstream. One of our guys speculates that because some allocations are "owned" by the returned TokenStream, but the returned TokenStream is a temporary, the lifetimes might not be sufficient and some deallocations can happen.
If I change the nested calls stripComments(cook(lex(Code, Opts), Opts)) to separate calls stored to separate locals, the problem goes away.

Thanks for the report and revert, and sorry for the failures and the late response (was OOO on Friday).

One of our guys speculates that because some allocations are "owned" by the returned TokenStream, but the returned TokenStream is a temporary, the lifetimes might not be sufficient and some deallocations can happen.
If I change the nested calls stripComments(cook(lex(Code, Opts), Opts)) to separate calls stored to separate locals, the problem goes away.

Thanks for diagnosing it. Yeah, this is the root cause. cook returns a tokenStream (which has its own payload), and this token stream is a temporary and being destroyed after the stripComments statement, and we have dangling pointers. I will prepare a fix.