Page MenuHomePhabricator

[lldb] Fix crash due to unicode characters and dollars in variable names.

Authored by teemperor on Jul 3 2019, 11:19 PM.



Our variable loading optimization uses some handmade lexer that currently doesn't
properly handle characters such as dollar-symbols or unicode characters. This patch
replaces our own lexer with the Clang lexer which properly handles all these corner cases.

Diff Detail

Event Timeline

teemperor created this revision.Jul 3 2019, 11:19 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 3 2019, 11:19 PM

I realized after revisiting this code that this can't handle Unicode and we probably should just use Clang's Lexer to get the tokens. Still, it would be nice if we can land this in case the Lexer rewrite takes longer than expected.

davide requested changes to this revision.Jul 4 2019, 9:41 AM
davide added a subscriber: davide.
davide added inline comments.

what if you have another variable in the same scope named just foo ?
What expr $foo will print?
I would expect this to be recognized as expr $$foo.

This revision now requires changes to proceed.Jul 4 2019, 9:41 AM
davide added inline comments.Jul 4 2019, 9:44 AM

Actually, I think the previous example should be fine, on a second thought. The example that I think might break is the one where you have an identifier in your source code named $R0. Is that allowed?

teemperor marked an inline comment as done.Jul 4 2019, 4:00 PM
teemperor added inline comments.

Is $R0 supposed to access a register? Or is that somehow a special syntax?

In any case the current code will prefer a local variable with from what I can see.

I thought about this a little more, and we discussed offline, but these are my two comments (we should understand what's the behaviour and add tests). We can address some of them as follow-up but I would like to understand what happens here.

  1. For C++, lldb names expression results using the convention $0, $1, etc... Can you have a variable named $0 in your source code? If so, what happens when you evaluate expr patatino? Will lldb be smart enough to understand that $0 is already defined and start with $1 ? Or it will override the scope?
  2. If you have $pat in your example, and you type expr int $pat = 234234, does that work?

Do we really expect users to actually write code where variables start with '$'? We will need to make some clear rules of the lookup order if so. I would suggest the following order

  • look for real variables by exact name match first
  • look for expression global variables next
  • look for expression results next ($<digit>)
  • look for registers

The idea was that a "$" prefix wouldn't be found in normal code. In LLDB $ has the following special use cases:

  • all expression results are named $<digit> just like LLDB
  • if you use $ it can be followed by a register name so you can access registers in expressions ($rax, $sp, etc)
  • any definitions in your expression where the variable is prefixed with '$' will become an expression global variable available to all subsequent expressions:
(lldb) expr int $my_global = 12;
(lldb) expr ++$my_global
  • All function names that start with '$' are currently thought to be reserved for expression and so that any functions we create don't interact with real functions

So the actual code that hit his use case was when one steps into the expression wrapper function we create in LLDB (which uses these dollar signs). So we (luckily) don't have any user code that triggers this.

I should note that I anyway discovered that the variable loading optimization that caused this regression doesn't support any unicode characters (not that we encourage that), so I'll anyway rewrite this parsing code with something that reuses the Clang lexer instead of our own ad-hoc parsing (which probably wil save us from more weird corner cases in the future).

teemperor updated this revision to Diff 208569.Jul 8 2019, 6:44 PM
teemperor retitled this revision from [lldb] Fix crash due to dollar character in variable names. to [lldb] Fix crash due to unicode characters and dollars in variable names..
teemperor edited the summary of this revision. (Show Details)
  • Rewrote the patch to use the Clang lexer, dropped our ad-hoc lexer.
  • Make sure we only lex the expression once, not for each local variable.
  • Added tests for unicode characters and a bunch of dollar-related corner cases. The current behavior seems reasonable and what people expect in the comments, so I just keep them around.

I would prefer if we could just consider this patch as a regression fix and not some official support for dollar-variables in user code (which seems like a mess to define properly). I added some tests that at least make sure the current behavior is tested, we can decide later if that's really what we want to do (but the current behavior seems quite reasonable and similar to what people suggest).

JDevlieghere added inline comments.Jul 9 2019, 9:06 AM

Can we write this as while(!lex.LexFromRawLexer(token))?

teemperor marked an inline comment as done.Jul 10 2019, 11:00 AM
teemperor added inline comments.

LexFromRawLexer seems to return true when the current result is the last token (but still a valid token), so I think no :(

teemperor accepted this revision.Jul 11 2019, 5:12 PM

Closed by 365698

teemperor abandoned this revision.Jul 11 2019, 5:12 PM