This adds stack scanning as fallback unwind plan to find a value that looks like
return address. If the return address is not valid, continue scanning from next
locatoin on stack when trying fallback unwind and the active row is raSearch.
Details
- Reviewers
labath clayborg jasonmolenda
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
It is unclear how this patch works. Was there support for searching for the return address on the stack already? And this patch is just adding the a few register unwind rules for when/if the return address _is_ found for x86?
This will need tests.
lldb/include/lldb/Symbol/UnwindPlan.h | ||
---|---|---|
225 | What is this magic number/bit here? Is it ok to clobber bit zero here? | |
263 | Are we assuming "search_offset" must be aligned to at least a 4 bit boundary so that we can put something in bit zero? | |
267 | You are using & and | in other locations, and now we are using modulo? I would just keep with the & operators | |
lldb/include/lldb/Target/ABI.h | ||
99 | Add a default implementation that returns false so you don't have to change all of the classes that don't implement this. | |
lldb/source/Plugins/ABI/AArch64/ABIMacOSX_arm64.h | ||
30–33 ↗ | (On Diff #424268) | Don't make the CreateStackWalkingUnwindPlan pure virtual if most of the implementations are just "return false;". I would make a default implementation in the base class that just returns false. |
lldb/source/Plugins/ABI/X86/ABIWindows_x86_64.cpp | ||
734 | What code actually does the search for a return address? Is that already available somewhere in the unwind plans? Reading through this plan I see that it sets the CFA to be RA search but I fail to see any searching going on. | |
lldb/source/Plugins/SymbolFile/NativePDB/PdbAstBuilder.cpp | ||
1258 ↗ | (On Diff #424268) | all changes in this file don't seem related to this patch? Revert? |
If we do find what looks like a return address, is there any validation done on the instruction before the return address to see if it is a function call instruction? That would be the best way to validate that something on the stack just doesn't look like a return address and is actually just a random function pointer. I would like to see validation on anything that looks like a return address like:
- make sure it doesn't point to the first address of a function since if it actually is a return address is should be in the middle of a function
- make sure the previous instruction was a function call to the function the current function from which we are unwinding if we have function bounds for the current frame
- if we don't have function bounds for the current function, we can benefit from knowing the function start address in the current frame if we can find the call instruction since that will help us parse the correct function prologue
This is obviously building on Pavel's work of https://reviews.llvm.org/D66638 for Win32 systems. What problem with that existing patchset is this addressing? Pavel's original patch assumes that we can retrieve a function's expected stack frame size, and the size of the caller's arguments on the stack, plus a bit of searching around that location, as he describes. It seems like in ABIWindows_x86_64::CreateStackWalkingUnwindPlan you're adding a scheme that tries to look at the area directly around $sp for a return pc? This seems quite fragile by itself.
The idea of a stack walk based purely on looking at stack contents, outside of the Windows ABI, is interesting, but as Pavel mentions in his descriptions, you have to be careful to not get tricked by function pointers on the stack (for return-pc's) and pointers to stack memory (for spilled fp or sp values). If we have an ABI that always writes fp+return-pc to stack at the start of the stack frame for non-leaf frames, we could maybe do something as a last resort, I haven't thought about it.
Greg's feedback is all solid. I'm more curious about what higher level fix is being pursued with this patch. Pavel is clearly the person who knows the most about this part of the unwinder.
I guess the other comment I would make is whether we should have a new ivar, instead of using the unused 0th bit of search_offset.
fwiw I wasn't real clear in my previous comment. As you say in the title of this, you're trying to backtrace through a case where you have no symbol file, and I'm guessing the Windows ABI doesn't use a frame pointer register (rbp) that is is spilled to stack at the start of a stack frame by convention. The ArchitecturalDefaultUnwindPlan's that we use on ABIs where you see saved-pc followed by saved-fp can do a stack walk following the chain of those two spilled registers pretty reliably, once they're off of the leaf frame. This isn't the case on Windows right. You have nothing but a saved return address, and the stack pointer is not spilled unless there's some kind of variable length stack frame I'm guessing.
Thanks for the feedback. As you pointed out that stack scanning without symbol file info is fragile due to function pointers in stack and we don't know the parameter size pushed into the stack.
Originally, I want lldb to do continue unwinding when it encounters some frames are in modules that don't have symbol info and no rbp value, since we have some crash reports with such problem. But later I found that we actually do have the symbol files for those modules. It seems not necessary anymore...
The patch description could definitely use more details about the motivation for the change, and a description of how it works. Apart from the fact that it uses the raSearch feature I introduced a while back, I don't know much about how it works (and given the planned changes, I am not going to try to understand that), but I think I can provide some details/thoughts about the motivation.
(IIUC), the motivating case is some code in kernel32.dll, which ends up setting rbp to zero. I don't know why it does that, but I suspect it may have something to do with the windows syscall convention (I know that ebp is used in i386 linux, for instance). While it would be definitely interesting to understand why it does that, I don't think it really matters -- one can never really exclude the possibility that some code (inline asm perhaps) will be messing with the frame pointer. So, scanning the stack for return addresses as a *last* last resort (the architectural default plan is already a kind of a last resort option) seems like it could be useful, regardless of the architecture (although the actual algorithm may differ somewhat for x86, which stores the return address on the stack directly, and say arm, which uses a link register). And given that the choice is between "showing nothing" and "showing something potentially incorrect", I don't think we need to be extra careful about vetting this address. All the checks Greg mentioned seem fine, but I don't see a compelling reason to implement them (all of them), especially given that they would not be useful in this case -- all we have is a minidump core file with some stack memory and a list of loaded modules (not necessarily the modules themselves). In this case, all we can do is to check whether the pointer points to some known code region (which is already done IIRC).
We wrote a command in python called "sbt" and we run this when stacks are truncated when loading minidumps. It often gives us pretty bad results and you would be surprised how many things on the stack look like code pointers. So if we do have the ability to do the checks when we do have the code and can try to verify that the thing on the stack is a return address, then it would be great to do so to avoid the noise. And if we are in code that has no symbols and we don't know the start address of the function, it would be VERY helpful to correctly backtrace if we _can_ determine the start address of the current frame's function if we don't know it as then we can correctly parse the prologue. So adding these checks might actually get us back on track and allow us not to have to use the RA search. The flow would be:
- enable RA search in a frame that has no function bounds info from the object files or runtime info
- find the right thing on the stack that is the correct RA by finding the instruction before the RA and figuring out what address it calls
- use the new function address to create a real stack frame that uses opcode parsing
So there is real benefit to doing these checks IMHO as it can help us get back on track in live debug sessions. Core file sessions results will vary depending on it we actually have the original object files or not, many times we do not have this info, in which case we would just enable RA search and we might be able to unwind the PC, but not other registers if we can't actually find the prologue or if we don't have stack frame details from breakpad
"sbt" == stack back trace
Just looks at the stack for things that look like code and tries to make something that looks like a backtrace
I also wonder if this shouldn't require a separate flag to turn on this search, or be a separate command in the way Greg did. People rely on backtraces being accurate. It's fine to add something more like a desperation play - something is sometimes better than nothing, but if we are presenting a view that we don't have full confidence in, we really should tell users that so they can factor it into their analysis.
I'm not saying we can't do that, but I don't think it has done be done straight in the first version. A lot of these checks are really just heuristics and can have false positives. For example, if we disqualify return addresses pointing to the first instruction of a function, then these cases won't work:
- noreturn functions (the call could be the last instruction in the previous function)
- signal handler frames, where the handler's return address points to the first instruction of the trampoline (because it didn't really call it, but it still knows how to unwind from it)
I think it would be better discuss individual checks independently of the overall infrastructure (and ideally with more data).
That said, it would definitely be nice give some indication of how much do we trust the backtrace that we've computed. A separate command or a setting kind of achieve that, but they're very coarse-grained (you don't know how much trust to put into a particular backtrace), and they need the user to know of their existence. I am wondering whether attaching some kind of flag to a particular frame (similar to how we add [opt] for optimized functions) to indicate that it may not be entirely correct could be a better option.
Update: address some inline comments.
For testing, I don't know how to create minidump in yaml format like the one in lldb/test/Shell/SymbolFile/Breakpad/Inputs/unwind-via-stack-win.yaml. Anyone knows how? Or have other ideas for testing?
I am wondering whether attaching some kind of flag to a particular frame (similar to how we add [opt] for optimized functions) to indicate that it may not be entirely correct could be a better option.
I'm lean towards this idea that adding a flag to the frame unwinded from stack scanning, maybe all frames after that one should have the flag.
lldb/include/lldb/Symbol/UnwindPlan.h | ||
---|---|---|
225 | When the last bit isn't set, it's the first ra search. The reason why we want to know this info is because after we got cfa via stack scanning in TryFallbackUnwindPlan when initializing zeroth/nonzeroth frames in RegisterContextUnwind, we will use that cfa for unwind next frame. If the next frame is valid, we are done. Otherwise, we get the incorrect cfa for the parent frame. Then we need to redo the stack scanning to find parent frame's cfa (UnwindLLDB::GetOneMoreFrame) but should skip the region that we have already scanned. So, the search_offset tells how many bytes should we skip. | |
263 | search_offset is usually the multiple of 4 or 8, so we can use the last bit. | |
lldb/source/Plugins/ABI/X86/ABIWindows_x86_64.cpp | ||
734 | This sets the unwind plan row to be ra search. The actual ra search happens on RegisterContextUnwind::ReadFrameAddress at the case UnwindPlan::Row::FAValue::isRaSearch. |
Use extra bool variable for is_first_search, because search_offset might be an odd number if parameter byte size in stack is an odd number.
lldb/include/lldb/Symbol/UnwindPlan.h | ||
---|---|---|
263 | Updated: |
What is this magic number/bit here? Is it ok to clobber bit zero here?