Page MenuHomePhabricator

Fix for PrintStackTraces
ClosedPublic

Authored by ravitheja on Jun 10 2016, 12:50 AM.

Details

Summary

The issue arises due to the wrong unwinder used for the first
stack frame, where the default unwinder returns erroneous frame
whereas the fallback would have given the correct frame had it
been used.
The following fix consists of two parts ->

  1. The first part changes the unwinding strategy, earlier the default unwinder was used to get 2 more stack frames and if it failed a fallback unwinder was used. Now we try to obtain as many frames (max 10) as we can get from default unwinder and also fallback unwinder and use the one that gives more number of frames.
  2. Normally unwindplans are assosciated with functions and the row to be used is obtained from the offset (obtained from the low_pc of the function symbol). Sometimes it may occur that the unwindplan is assosciated to the complete Elf section in which case the offset calculation would be wrong as the debugger uses the same offset originally obtained from the function symbol. Hence this offset is recalculated.

Diff Detail

Event Timeline

ravitheja updated this revision to Diff 60317.Jun 10 2016, 12:50 AM
ravitheja retitled this revision from to Fix for PrintStackTraces.
ravitheja updated this object.
labath edited edge metadata.Jun 10 2016, 10:57 AM

So, I'll let Jason be the definitive judge here, but I have to say I don't like this solution.

Computing 10 frames significantly increases the amount of processing needed to get some result. Most importantly, it increases the latency, as it will cause a lot more data to be pulled from the server in case of remote debugging and may force a lot more eh_frame parsing. Also, it makes the unwinder a lot more nondeterministic and harder to test as now the frame you get shown can depend on something happening 10 levels down.

I'd really prefer a solution where we can get default unwinder to do the right thing, so we could just avoid needing to rely on the fallback.

I agree with Pavel. Remember that most stops in the course of stepping will require an unwind (to tell step-in vrs. lateral step vrs. step out). This can happen many times in the course of a "user level" step. So for stepping, particularly on slow devices, unwinding is a bit part of the performance. We don't want to do any more unwinding than we have to.

Well its not possible for the assembly unwinder to do the correct thing in this situation , as the function is entered through a stub function (which has push instruction) and then a jump. The point of this patch is to catch when to use the CFI or the assembly unwinder.

ravitheja updated this revision to Diff 60533.Jun 13 2016, 8:34 AM
ravitheja edited edge metadata.

Making EH Frame CFI the full unwinder
when artificial symbols are found.

ravitheja updated this revision to Diff 60539.Jun 13 2016, 8:55 AM

Checks for nullptr

I think this idea in general is more acceptable, but we'll need an OK from Jason on the details.

Also, we should figure out a way to add a test for this. We should definitely add a more deterministic test and not rely on TestPrintStackTraces to check this functionality. Is it possible to write a self-contained test replicating these conditions, i.e. without depending on unwinding from standard library functions?

Well its not possible for the assembly unwinder to do the correct thing in this situation , as the function is entered through a stub function (which has push instruction) and then a jump. The point of this patch is to catch when to use the CFI or the assembly unwinder.

Note I said the "default" unwinder. If the assembly unwinder does not work, then maybe it should not be the default in this case.
Also, I seem to recall that the first plan being tried is the "augmented CFI", but that is very conservative in what it accepts as valid input. So maybe if we made it smarter (as in, recognize that in this case it does not need to do any augmentation and just pass through the eh_frame plan), then this and a lot of other problems would go away.

source/Plugins/Process/Utility/RegisterContextLLDB.cpp
1602 ↗(On Diff #60539)

What is this supposed to do?

jasonmolenda edited edge metadata.Jun 13 2016, 9:47 PM

Hi Ravi, sorry for not having time to look at this patch yet - I was out of the office Friday and was catching up on everything today. I'll look at this tomorrow. My initial reactions are to be a little worried. It sounds like you have functions that are non-ABI conforming -- which is fine for functions that don't have external linkage -- and you're tweaking the unwinder until it works for this case. When we were discussing this on lldb-dev a couple of weeks ago, the functions in question had no eh_frame code. Why are you forcing eh_frame for frame 0? Is this a different case than the one you were debugging ten days ago? One initial comment is I don't know about adding AlwaysRelyOnEHUnwindInfo to the DynamicLoader - the section of code you added this check to in RegisterContextLLDB is there because we had one or two solibs on darwin where we had to use eh_frame, hence it made sense to ask the DynamicLoader plugin. If we were to add something like AlwaysRelyOnEHUnwindInfo for an entire platform, I'd probably put it at FuncUnwinders::GetUnwindPlanAtNonCallSite. But I really don't think that's a good idea unless there's a concrete example where a non-ABI-conforming function actually has eh_frame that describes how to correctly unwind from it.

(my initial opinion is that a debugger trying to unwind out of non-ABI conforming functions without hand-written eh_frame describing how to do it, are in the "it would be nice if it worked" camp; I'd be reluctant to make changes that could compromise lldb otherwise to support this kind of scenario.)

But I really haven't given the patch or the discussion a fair read-through and thought about it more - I'll do that tomorrow.

ravitheja added inline comments.Jun 14 2016, 1:44 AM
source/Plugins/Process/Utility/RegisterContextLLDB.cpp
1602 ↗(On Diff #60539)

It stops the the TryFallbackUnwindplan to use the assembly unwinder.

ravitheja added a comment.EditedJun 14 2016, 1:59 AM

so regarding this particular situation I want to give little more insight ->
It starts out from here

0x40143a <+346>: movabsq $0x403e32, %rdi           ; imm = 0x403E32 
0x401444 <+356>: movb   $0x0, %al
0x401446 <+358>: callq  0x400d30                  ; symbol stub for: printf
0x40144b <+363>: movq   0x6071c0, %rdi
0x401453 <+371>: movl   %eax, -0xdc(%rbp)
0x401459 <+377>: callq  0x400ed0                  ; symbol stub for: fflush  (jump to next disassembly block 1)
0x40145e <+382>: movl   $0x40, %esi
0x401463 <+387>: leaq   -0xb0(%rbp), %rdi
0x40146a <+394>: movq   0x607158, %rdx
0x401472 <+402>: movl   %eax, -0xe0(%rbp)

(lldb) disassemble (disassembly block1)
a.out`fflush:

0x400ed0 <+0>:  jmpq   *0x206212(%rip)           ; _GLOBAL_OFFSET_TABLE_ + 232
0x400ed6 <+6>:  pushq  $0x1a
0x400edb <+11>: jmp    0x400d20                      ;  (jump to **disassembly block 2**)

(lldb) disassemble (disassembly block 2)

0x400d20:       pushq  0x2062e2(%rip)            ; _GLOBAL_OFFSET_TABLE_ + 8
0x400d26:       jmpq   *0x2062e4(%rip)           ; _GLOBAL_OFFSET_TABLE_ + 16   I think this jump goes to  0x7ffff7df04a0

ld-linux-x86-64.so.2`___lldb_unnamed_symbol95$$ld-linux-x86-64.so.2:

0x7ffff7df04a0 <+0>:  subq   $0x38, %rsp    -> The testcase tries to unwind out of here and fails.
0x7ffff7df04a4 <+4>:  movq   %rax, (%rsp)
0x7ffff7df04a8 <+8>:  movq   %rcx, 0x8(%rsp)
0x7ffff7df04ad <+13>: movq   %rdx, 0x10(%rsp)
0x7ffff7df04b2 <+18>: movq   %rsi, 0x18(%rsp)
0x7ffff7df04b7 <+23>: movq   %rdi, 0x20(%rsp)
0x7ffff7df04bc <+28>: movq   %r8, 0x28(%rsp)
0x7ffff7df04c1 <+33>: movq   %r9, 0x30(%rsp)
0x7ffff7df04c6 <+38>: movq   0x40(%rsp), %rsi

Now as you can see, from inside fflush its not possible for the assembly unwind to figure out the situation.
@jasonmolenda The functions I posted in the lldb-dev are the same, here i am just posting how it got there.
There is eh_frame information for these functions, that is able to correctly point out the CFA.

lldb) image show-unwind --address 0x7ffff7df04a0
UNWIND PLANS for ld-linux-x86-64.so.2`___lldb_unnamed_symbol95$$ld-linux-x86-64.so.2 (start addr 0x7ffff7df04a0)

Asynchronous (not restricted to call-sites) UnwindPlan is 'assembly insn profiling'
Synchronous (restricted to call-sites) UnwindPlan is 'eh_frame CFI'

Assembly language inspection UnwindPlan:
This UnwindPlan originally sourced from assembly insn profiling
This UnwindPlan is sourced from the compiler: no.
This UnwindPlan is valid at all instruction locations: yes.
Address range of this UnwindPlan: [ld-linux-x86-64.so.2..text + 88512-0x0000000000015a30)
row[0]: 0: CFA=rsp +8 => rsp=CFA+0 rip=[CFA-8]
row[1]: 4: CFA=rsp+64 => rsp=CFA+0 rip=[CFA-8]
row[2]: 94: CFA=rsp -8 => rsp=CFA+0 rip=[CFA-8]

eh_frame UnwindPlan:
This UnwindPlan originally sourced from eh_frame CFI
This UnwindPlan is sourced from the compiler: yes.
This UnwindPlan is valid at all instruction locations: no.
Address range of this UnwindPlan: [ld-linux-x86-64.so.2..text + 88512-0x0000000000015a21)
row[0]: 0: CFA=rsp+24 => rip=[CFA-8]
row[1]: 4: CFA=rsp+80 => rip=[CFA-8]
row[2]: 94: CFA=rsp +8 => rip=[CFA-8]

Arch default UnwindPlan:
This UnwindPlan originally sourced from x86_64 default unwind plan
This UnwindPlan is sourced from the compiler: no.
This UnwindPlan is valid at all instruction locations: no.
row[0]: 0: CFA=rbp+16 => rbp=[CFA-16] rsp=CFA+0 rip=[CFA-8]

Arch default at entry point UnwindPlan:
This UnwindPlan originally sourced from x86_64 at-func-entry default
This UnwindPlan is sourced from the compiler: no.
This UnwindPlan is valid at all instruction locations: not specified.
row[0]: 0: CFA=rsp +8 => rsp=CFA+0 rip=[CFA-8]

As you can see the eh_frame UnwindPlan is correct here.

@labath In order to reproduce this situation without the help of standard library, I would have to write handwritten assembly and the CFI directives for that, is that fine ?

@labath In order to reproduce this situation without the help of standard library, I would have to write handwritten assembly and the CFI directives for that, is that fine ?

Yes, I think that's fine. Obviously that will make the test x86-specific (and probably linux-specific, although it would be great if that can be avoided (*)), but at least it will be well focused, and not relying on random timings in other tests. A less preferred but-still-better-than-status-quo option would be to keep the standard library dependency but remove the timing issues (e.g., by setting the breakpoint in fflush, instruction-stepping 100 times, and making sure you unwind correctly from each place).

(*) One way to do that is to avoid running code. If you can write the test in a way that it does not need a running process then you can just check in a tiny (linux) module, load it and query some properties of the contained functions and their unwind plans...

Thanks Ravi, I see the problem here and I agree that lldb should use eh_frame to unwind from this function - that's the only way this is going to work. I'm surprised that there is accurate eh_frame instructions for this function, it's great to see it.

I'm wondering if we can come up with a way to detect this scenario - where we're unwinding from a function that doesn't conform to the ABI. I know it's a little special-case-y but what if we modified FuncUnwinders::GetUnwindPlanAtNonCallSite to:

Get the assembly unwind plan, eh_frame unwind plan, the ABI ArchitectureDefaultAtFunctionEntry unwind plan, and the ArchitectureDefault unwind plan.

If the eh_frame's row 0 (initial set of unwind instructions) has a pc/lr restore rule that doesn't match any of the other three unwind plans, then we assume that the eh_frame instructions have knowledge that the others do not have, and we use eh_frame for this function.

I'll write this up tomorrow and see if it works on mac (although we've almost completely stopped using eh_frame on our platform so I'm not a very good test bed at all) - I can imagine how the code would look and I think it's possible to do.

If eh_frame accurately describes the prologue instructions (I think I mentioned in the lldb-dev emails a couple weeks ago -- it is not required to describe the prologue), it the pc unwind rule will be the same as the ABI's ArchitectureDefaultAtFunctionEntry. Or it will be the same as the assembly unwind rule. If the eh_frame doesn't describe the prologue -- if the 0th row is the register state after the prologue has executed -- it will match the ABI ArchitectureDefault's unwind rule to restore the pc.

The main oddity here are architectures like arm where the restore rule for pc may be in terms of the pc, or it may be in terms of the link register ("return address register" in lldb's terminology). Different UnwindPlans may specify how to restore the pc register value, or how to restore the ra register value -- they're inconsistent -- and that needs to be taken into account.

I know this isn't the approach you took with the patch, but I think it would be an improvement to lldb (especially if systems are getting eh_frame like this) and it will solve the problem I believe. What do you think? As I said, I'm happy to write up what I'm suggesting here, I don't want to ask you to do extra work to try out this idea. I'm in and out of the office this week because of an apple conference but I'll try to get something together over the next few days.

As for writing test cases, these are tricky, and I need to write unwind test cases too, it's something I haven't had time to do yet for lldb and we really need these test cases, the unwinder is VERY hard to work on without a rich testsuite.

They end up being architecture (x86_64) and platform (mac) specific... you write a top-level driver in C which calls hand-written assembly functions (either inlined or a .s file) and the little details / ABI vary from platform to platform, I'm not sure if it's worth trying to make them portable - we can rely on buildbots to run them on the matrix of systems if nothing else. The only alternative are core files of processes in the correct stopped state - but then you have to store gigantic core files and possibly binaries along with them and I don't think we're going to head down that route any time soon.

Regarding the part of the patch where eh_frame is used unconditionally on Linux systems, I'm hesitant to make a change like that (beyond the notes I mentioned earlier about how I'd do it in FuncUnwinders instead of down in RegisterContextLLDB). Modern gcc's put out eh_frame with prologue and epilogue descriptions on x86_64 - but there's no guarantee that eh_frame we see on a linux or bsd system will come from a modern gcc. For instance, clang (last I checked) does not emit epilogue information in the eh_frame (as did gcc until a few years ago). gdb does pretty well relying exclusively on eh_frame, but the design of the lldb unwinder was to do better than gdb, by using multiple sources of unwind information.

Of all the unwind info sources, eh_frame is the most frustrating, IMO. It *can* be everything a debugger could ever need -- but it isn't guaranteed to be that. It's only guaranteed to be enough for exception handling, which is a very different set of requirements from what a debugger needs. And there's no reliable way for a debugger to tell how descriptive an eh_frame entry is.

ravitheja added a comment.EditedJun 15 2016, 7:45 AM

@jasonmolenda The approach suggested seems promising, I look forward to it, if u want I can test it at my end ? once I see it. But I must say this patch also fixes the row offset calculation, so the unwindplan specified in the eh_frame plan is specified for the complete PLT section and not w.r.t the function symbol, so the offset used for picking the row from the unwindplan should be calculated taking into consideration this point.

clayborg resigned from this revision.Jun 20 2016, 5:16 PM
clayborg removed a reviewer: clayborg.

Jason Molenda should be able to adequately review any patches. If he is OK, then I am OK.

ravitheja updated this revision to Diff 61668.Jun 23 2016, 5:48 AM
ravitheja edited edge metadata.

Adding testcase

ravitheja updated this revision to Diff 61936.Jun 27 2016, 12:36 AM

Renaming testcase

ravitheja updated this revision to Diff 62831.Jul 6 2016, 1:28 AM

Removing other files.

jasonmolenda accepted this revision.Jul 6 2016, 4:14 PM
jasonmolenda edited edge metadata.

Ravitheja and I had some discussions over email about a possible alternate approach to this issue - I've committed that approach as r274700 after testing help from Ravi. This patchset is tracking the very nice test case that was written to replicate the problem, thanks for doing that Ravitheja!

This revision is now accepted and ready to land.Jul 6 2016, 4:14 PM
ravitheja closed this revision.Jul 7 2016, 6:07 AM