Our variable loading optimization uses some handmade lexer that currently doesn't
properly handle characters such as dollar-symbols or unicode characters. This patch
replaces our own lexer with the Clang lexer which properly handles all these corner cases.
I realized after revisiting this code that this can't handle Unicode and we probably should just use Clang's Lexer to get the tokens. Still, it would be nice if we can land this in case the Lexer rewrite takes longer than expected.
what if you have another variable in the same scope named just foo ?
Actually, I think the previous example should be fine, on a second thought. The example that I think might break is the one where you have an identifier in your source code named $R0. Is that allowed?
Is $R0 supposed to access a register? Or is that somehow a special syntax?
In any case the current code will prefer a local variable with from what I can see.
I thought about this a little more, and we discussed offline, but these are my two comments (we should understand what's the behaviour and add tests). We can address some of them as follow-up but I would like to understand what happens here.
- For C++, lldb names expression results using the convention $0, $1, etc... Can you have a variable named $0 in your source code? If so, what happens when you evaluate expr patatino? Will lldb be smart enough to understand that $0 is already defined and start with $1 ? Or it will override the scope?
- If you have $pat in your example, and you type expr int $pat = 234234, does that work?
Do we really expect users to actually write code where variables start with '$'? We will need to make some clear rules of the lookup order if so. I would suggest the following order
- look for real variables by exact name match first
- look for expression global variables next
- look for expression results next ($<digit>)
- look for registers
The idea was that a "$" prefix wouldn't be found in normal code. In LLDB $ has the following special use cases:
(lldb) expr int $my_global = 12; (lldb) expr ++$my_global
So the actual code that hit his use case was when one steps into the expression wrapper function we create in LLDB (which uses these dollar signs). So we (luckily) don't have any user code that triggers this.
I should note that I anyway discovered that the variable loading optimization that caused this regression doesn't support any unicode characters (not that we encourage that), so I'll anyway rewrite this parsing code with something that reuses the Clang lexer instead of our own ad-hoc parsing (which probably wil save us from more weird corner cases in the future).
- Rewrote the patch to use the Clang lexer, dropped our ad-hoc lexer.
- Make sure we only lex the expression once, not for each local variable.
- Added tests for unicode characters and a bunch of dollar-related corner cases. The current behavior seems reasonable and what people expect in the comments, so I just keep them around.
I would prefer if we could just consider this patch as a regression fix and not some official support for dollar-variables in user code (which seems like a mess to define properly). I added some tests that at least make sure the current behavior is tested, we can decide later if that's really what we want to do (but the current behavior seems quite reasonable and similar to what people suggest).