This is an archive of the discontinued LLVM Phabricator instance.

Support __kernel_rt_sigreturn in frame initialization
ClosedPublic

Authored by JosephTremoulet on Jun 21 2019, 1:23 PM.

Details

Summary

Add __kernel_rt_sigreturn to the list of trap handlers for Linux (it's
used as such on aarch64 at least).

Skip decrement-and-recompute for trap handlers in
InitializeNonZerothFrame, as signal dispatch may point the child frame's
return address to the start of the return trampoline.

Diff Detail

Repository
rL LLVM

Event Timeline

Herald added a project: Restricted Project. · View Herald Transcript

FYI, I sent mail about this to lldb-dev.. I'll copy the contents here for the benefit of anybody who didn't see it there but could use the context:

Hi,

I'm trying to use lldb in a project where I need to report stack traces from signal handlers, and need to do so on aarch64 Linux. Even for "synchronous" signal handling, I'm hitting a couple issues preventing this from working "out of the box":
1 - The platform signal handler trampoline function on the stack is __kernel_rt_sigreturn, not sigtramp. I'm wondering if it's ok to just add __kernel_rt_sigreturn to the list of trap handler symbol names in PlatformLinux, or if I need to include it in "UserSpecifiedTrapHandlerFunctionNames", or something else
2 - When the user handler is invoked, its return address is set to the very first byte of __kernel_rt_sigreturn, which throws off unwinding because we assume that frame must really be at a call in the preceding function. I asked about this on IRC, where Jan Kratochvil mentioned that the decrement shouldn't happen for frames with S in the eh_frame's augmentation. I've verified that __kernel_rt_sigreturn indeed has the S. I'm not sure where I'd find official documentation about that, but the DWARF Standards Committee's wiki[1] does link to Ian Lance Taylor's blog[2] which says "The character ‘S’ in the augmentation string means that this CIE represents a stack frame for the invocation of a signal handler. When unwinding the stack, signal stack frames are handled slightly differently: the instruction pointer is assumed to be before the next instruction to execute rather than after it." So I'm interested in encoding that knowledge in LLDB, but not sure architecturally whether it would be more appropriate to dig into the eh_frame record earlier, or to just have this be a property of symbols flagged as trap handlers, or something else.

I'd very much appreciate any feedback on this. I've put up a patch[3] on Phab with a testcase that demonstrates the issue (on aarch64 linux) and an implementation of the low-churn "communicate this in the trap handler symbol list" approach.

Thanks,
-Joseph

[1] - http://wiki.dwarfstd.org/index.php?title=Exception_Handling
[2] - https://www.airs.com/blog/archives/460
[3] - https://reviews.llvm.org/D63667

This looks good. Over in RegisterContextLLDB::GetFullUnwindPlanForFrame there's a behaves_like_zeroth_frame bool with similar setup -- in that case, it's trying to decide what type of UnwindPlan to pick for the frame, either one that is synchronous ("valid only at call sites / throwable locations") or asynchronous ("valid at every instruction"). For a behaves_like_zeroth_frame it needs to pick an asynchronous unwind plan. You may want to look at that too.

I didn't know about the S augmentation letter in eh_frame. See DWARFCallFrameInfo::ParseCIE / DWARFCallFrameInfo::FDEToUnwindPlan, that's where those are parsed. Each function with eh_frame has an FDE, and the FDE will refer to a CIE with basic setup information. Usually multiple FDE entries point to a single CIE, with the CIE specifying the things that are common among them all (usually things like the initial unwind state on function entry and such). We could add the flag to the UnwindPlan object. This would basically be telling us the same thing that IsTrapHandlerSymbol() is telling us, right, but I'm not sure the order of operations would make it easy to use - we pick out an unwind plan (which might be eh_frame based) too late, if I remember the codeflow right. It'd be a good change to capture that information from the eh_frame though, even if we don't see a clear way to use it right now.

We're going to pick up a FuncUnwinders object (an object with all the unwind plans for a given function) pretty early I bet, but we don't want to force a parse of eh_frame when we're creating that object if we can avoid it - parsing the eh_frame section of a binary the first time (and creating an index into it) is relatively slow, so we try to put it off if we can.

fwiw, and I think you've got this figured out already, with asynchronous signals we want to say (1) don't decrement the pc value when doing symbol lookups, (2) all registers can be retrieved (normally only callee-spilled aka non-volatile registers can be retrieved mid-stack), (3) we need to pick an asynchronous unwind plan because we may not be at a throwable / function call location like we normally are mid-stack.

  • Fix typos
  • Convey 'S' eh_frame augmentation to UnwindPlan

I've updated this with code to recognize the 'S' in the eh_frame augmentation and record it in a new bool member of UnwindPlan, m_plan_is_for_signal_trap. I haven't hooked up any consumers of the new bit; as you say, with the current code flow we don't parse the eh_frame info until after we've decided whether the current frame is a trap handler frame or not.

I took a look at behaves_like_zeroth_frame. In my case, we're not getting to the point of consulting it, because before checking it, if we are processing a trap handler frame or if AlwaysRelyOnEHInfo returns true (both of which are true in my case), we use the EH Frame unwind plan if it is valid at the current address, and that valid at current address check is already passing. I'm somewhat reluctant to fiddle with the behaves_like_zeroth_frame bit speculatively. Note that with this change, it will get set for the next frame above __kernel_rt_sigreturn (because the current test checks if the next-lower frame is an eTrapHandler frame), which I think is where we want the async processing.

I see that we also do this "subtract 1 from the pc" thing in StackFrame::GetSymbolContext, and I believe this is why even with my change here I'm seeing the wrong function name in backtraces (e.g., in the new test, it shows "abort_caller -> [function that happened to precede __kernel_rt_sigreturn] -> handler" rather than "abort_caller -> kernel_rt_sigreturn -> handler"). I'm not sure how best to address that, since the base StackFrame and RegisterContext classes don't have this notion of "frame type" / "trap handler frame" that the "-LLDB" subclasses have -- seems like we'd need to have GetFrameInfoAtIndex have another out parameter to specify frame type, and UnwindLLDB's implementation could then pull it from the RegisterContextLLDB -- would that be a good change? I don't actually need that for my current use case, the "wrong" name in the backtrace is fine...

Thanks for the notes on the asynchronous cases. I'd certainly be interested in getting those to work, though I haven't sorted out from where the unwinder can find the above-trap frame's register values (happy for pointers here!). I think I'm running into what this comment from GetFullUnwindPlanForFrame describes:

// If we're in _sigtramp(), unwinding past this frame requires special
// knowledge.  On Mac OS X this knowledge is properly encoded in the eh_frame
// section, so prefer that if available. On other platforms we may need to
// provide a platform-specific UnwindPlan which encodes the details of how to
// unwind out of sigtramp.

I'd like to get the synchronous case support working as a first step regardless.

  • fix copy pasta

Shouldn't you add also symbol __restore_rt (Fedora 30 x86_64)?

(gdb) bt
#0  handler (sig=6) at sigtest2.c:7
#1  <signal handler called>
#2  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#3  0x00007ffff7dea895 in __GI_abort () at abort.c:79
#4  0x000000000040118e in abort_caller () at sigtest2.c:12
#5  0x00000000004011c2 in main () at sigtest2.c:23
(gdb) frame 1
#1  <signal handler called>
(gdb) disas
Dump of assembler code for function __restore_rt:
=> 0x00007ffff7dfff00 <+0>:	mov    $0xf,%rax
   0x00007ffff7dfff07 <+7>:	syscall 
   0x00007ffff7dfff09 <+9>:	nopl   0x0(%rax)
End of assembler dump.
  • Include __restore_rt

Shouldn't you add also symbol __restore_rt (Fedora 30 x86_64)?

Thanks, yes, looks like I should. Updated.

When I compare the backtrace from GDB and LLDB (on x86_64, I haven't tried aarch64 as x86_64 is much faster=easier to handle):

    frame #2: 0x0000000000401195 sigtest2`handler(sig=6) at sigtest2.c:9:5
#2  0x0000000000401195 in handler (sig=6) at sigtest2.c:9

    frame #3: 0x00007f5fd888ef00 libc.so.6`.annobin_sigaction.c + 16
#3  <signal handler called> => 0x00007f5fd888ef00 <+0>: mov    $0xf,%rax

    frame #4: 0x00007f5fd888ee75 libc.so.6`__GI_raise at internal-signals.h:84:10
#4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 $1 = 0x7f5fd888ee75

    frame #5: 0x00007f5fd888ee5e libc.so.6`__GI_raise(sig=2) at raise.c:48
Missing in GDB

    frame #6: 0x00007f5fd8879895 libc.so.6`__GI_abort at abort.c:79:7
#5  0x00007f5fd8879895 in __GI_abort () at abort.c:79

That LLDB frame #5 is a bit bogus:

(lldb) down
frame #5: 0x00007f5fd888ee5e libc.so.6`__GI_raise(sig=2) at raise.c:48
   45  	
   46  	  int ret = INLINE_SYSCALL (tgkill, 3, pid, tid, sig);
   47  	
-> 48  	  __libc_signal_restore_set (&set);
   49  	
   50  	  return ret;
   51  	}
(lldb) disas
libc.so.6`__GI_raise:
...
    0x7f5fd888ee5b <+299>: movl   %eax, %r8d
    0x7f5fd888ee5e <+302>: movl   $0x8, %r10d
    0x7f5fd888ee64 <+308>: xorl   %edx, %edx
    0x7f5fd888ee66 <+310>: movq   %r9, %rsi
    0x7f5fd888ee69 <+313>: movl   $0x2, %edi
    0x7f5fd888ee6e <+318>: movl   $0xe, %eax
    0x7f5fd888ee73 <+323>: syscall 
->  0x7f5fd888ee75 <+325>: movq   0x108(%rsp), %rax
    0x7f5fd888ee7d <+333>: xorq   %fs:0x28, %rax

That 0x00007f5fd888ee5e is not even shown by disas, also it is no point of a return from any call.

That LLDB frame #5 is a bit bogus

Thanks for taking a look at this. I'm trying to reproduce what you're seeing and failing. This is what I see, in a fedora:latest docker container on an Ubuntu host on an x86_64 machine, with my patch applied to lldb:

$ gcc -g -O0 -o sigtest main.c
$ lldb ./sigtest
(lldb) target create "./sigtest"
Current executable set to './sigtest' (x86_64).
(lldb) breakpoint set -f main.c -l 6
Breakpoint 1: where = sigtest`handler + 11 at main.c:7:5, address = 0x0000000000401171
(lldb) run
Process 149148 launched: '/home/josepht/scratch/sigtest' (x86_64)
Process 149148 stopped
* thread #1, name = 'sigtest', stop reason = signal SIGABRT
    frame #0: 0x00007ffff7e38e75 libc.so.6`__GI_raise + 325
libc.so.6`__GI_raise:
->  0x7ffff7e38e75 <+325>: movq   0x108(%rsp), %rax
    0x7ffff7e38e7d <+333>: xorq   %fs:0x28, %rax
    0x7ffff7e38e86 <+342>: jne    0x7ffff7e38eac            ; <+380>
    0x7ffff7e38e88 <+344>: movl   %r8d, %eax
(lldb) continue
Process 149148 resuming
Process 149148 stopped
* thread #1, name = 'sigtest', stop reason = breakpoint 1.1
    frame #0: 0x0000000000401171 sigtest`handler(sig=6) at main.c:7:5
   4   
   5    void handler(int sig)
   6    {
-> 7        printf("Set a breakpoint here.\n");
   8        exit(0);
   9    }
   10  
(lldb) bt
* thread #1, name = 'sigtest', stop reason = breakpoint 1.1
  * frame #0: 0x0000000000401171 sigtest`handler(sig=6) at main.c:7:5
    frame #1: 0x00007ffff7e38f00 libc.so.6`.annobin_sigaction.c + 16
    frame #2: 0x00007ffff7e38e75 libc.so.6`__GI_raise + 325
    frame #3: 0x00007ffff7e23895 libc.so.6`__GI_abort + 295
    frame #4: 0x000000000040118e sigtest`abort_caller at main.c:12:5
    frame #5: 0x00000000004011c2 sigtest`main at main.c:23:5
    frame #6: 0x00007ffff7e24f33 libc.so.6`__libc_start_main + 243
    frame #7: 0x00000000004010ae sigtest`_start + 46
(lldb) frame select 2
frame #2: 0x00007ffff7e38e75 libc.so.6`__GI_raise + 325
libc.so.6`__GI_raise:
->  0x7ffff7e38e75 <+325>: movq   0x108(%rsp), %rax
    0x7ffff7e38e7d <+333>: xorq   %fs:0x28, %rax
    0x7ffff7e38e86 <+342>: jne    0x7ffff7e38eac            ; <+380>
    0x7ffff7e38e88 <+344>: movl   %r8d, %eax
(lldb) frame select 3
frame #3: 0x00007ffff7e23895 libc.so.6`__GI_abort + 295
libc.so.6`__GI_abort:
->  0x7ffff7e23895 <+295>: movq   %fs:0x10, %rdx
    0x7ffff7e2389e <+304>: cmpq   %rdx, 0x19f493(%rip)      ; lock + 8
    0x7ffff7e238a5 <+311>: je     0x7ffff7e238ed            ; <+383>
    0x7ffff7e238a7 <+313>: movl   $0x1, %esi

My frame 2 seems to correspond to your frame 4, and then my frame 3 seems to correspond to your frame 6, without the bogus frame in between. Do I have the wrong repro steps or something?

Thanks,
-Joseph

ping @jasonmolenda -- do the updates look like what you had in mind when you said " It'd be a good change to capture that information from the eh_frame though, even if we don't see a clear way to use it right now"?

Also /cc @clayborg (thanks for the reply on the Dev alias!). I think your suggestion to communicate the info to a new behaves_like_zero flag on lldb_private::StackFrame (presumably communicated via GetFrameInfoAtIndex) would help for then fixing StackFrame::GetSymbolContext to find the right symbol for these frames. I'm inclined to think that would make the most sense as a follow-up change, but lmk if you have thoughts.

I'm also curious what else needs to happen to be ready to merge this. I've tested on Ubuntu and Darwin and now Fedora, but not sure what other platforms might have similar holes that this test would then immediately fail on. If I add the changes to connect the 'S' in the augmentation all the way to StackFrame::GetSymbolContext, then at least I could make sure the test log in the error case would include the name of the function missing from the trap handler symbol list, but I don't know if that's sufficiently motivating to roll that into this PR or if it's preferable to keep it small... happy for feedback.

I've updated this with code to recognize the 'S' in the eh_frame augmentation and record it in a new bool member of UnwindPlan, m_plan_is_for_signal_trap. I haven't hooked up any consumers of the new bit; as you say, with the current code flow we don't parse the eh_frame info until after we've decided whether the current frame is a trap handler frame or not.

I took a look at behaves_like_zeroth_frame. In my case, we're not getting to the point of consulting it, because before checking it, if we are processing a trap handler frame or if AlwaysRelyOnEHInfo returns true (both of which are true in my case), we use the EH Frame unwind plan if it is valid at the current address, and that valid at current address check is already passing. I'm somewhat reluctant to fiddle with the behaves_like_zeroth_frame bit speculatively. Note that with this change, it will get set for the next frame above __kernel_rt_sigreturn (because the current test checks if the next-lower frame is an eTrapHandler frame), which I think is where we want the async processing.

Just just to check, we've got a backtrace like

#0 trap_handler_function_in_my_program
#1 __kernel_rt_sigreturn
#2 function_that_was_interrupted_asynchonrously
#3 main

The eh_frame for #1 has the S bit, and we've hard-coded the name of the function as a known asynchronous signal handler. When we're choosing an unwind plan for frame #2, or symbolicating the address, we need to treat this as a "behaves like zeroth frame" function - it could be on the first instruction, so backing the pc up by one would put us in the wrong function scope.

I see that we also do this "subtract 1 from the pc" thing in StackFrame::GetSymbolContext, and I believe this is why even with my change here I'm seeing the wrong function name in backtraces (e.g., in the new test, it shows "abort_caller -> [function that happened to precede __kernel_rt_sigreturn] -> handler" rather than "abort_caller -> kernel_rt_sigreturn -> handler"). I'm not sure how best to address that, since the base StackFrame and RegisterContext classes don't have this notion of "frame type" / "trap handler frame" that the "-LLDB" subclasses have -- seems like we'd need to have GetFrameInfoAtIndex have another out parameter to specify frame type, and UnwindLLDB's implementation could then pull it from the RegisterContextLLDB -- would that be a good change? I don't actually need that for my current use case, the "wrong" name in the backtrace is fine...

I think Unwind::GetFrameInfoAtIndex can take a new output argument, bool behaves_like_zeroth_frame. It would be set to true if idx == 0 or if m_frames[idx - 1].IsTrapHandlerFrame() is true.

StackFrame.h would pick up a new ivar m_behaves_like_zeroth_frame and accessor method BehavesLikeZerothFrame().

StackFrameList::GetFramesUpTo will fetch the value via unwinder->GetFrameInfoAtIndex() and pass it as an argument to the StackFrame::StackFrame ctor.

Then in StackFrame::GetSymbolContext we can use the m_behaves_like_zeroth_frame to fix the symbol lookup logic.

I'm not wedded to the behaves_like_zeroth_frame nomenclature. But I think this approach makes sense.

Thanks for the notes on the asynchronous cases. I'd certainly be interested in getting those to work, though I haven't sorted out from where the unwinder can find the above-trap frame's register values (happy for pointers here!). I think I'm running into what this comment from GetFullUnwindPlanForFrame describes:

// If we're in _sigtramp(), unwinding past this frame requires special
// knowledge.  On Mac OS X this knowledge is properly encoded in the eh_frame
// section, so prefer that if available. On other platforms we may need to
// provide a platform-specific UnwindPlan which encodes the details of how to
// unwind out of sigtramp.

I'd like to get the synchronous case support working as a first step regardless.

Ah, regarding the S bit, looks good to me. I think I misunderstood the original problem you were working on. On Darwin systems, we have our friend sigtramp. When an async interrupt is received, the kernel saves the register context to stack and then calls sigtramp which does some instructions and then calls the trap handler function. Backing up the pc by 1 for __sigtramp is fine, because it did a normal CALL instruction to the trap handler.

I'm guessing on this Linux system, you've got a trap receiver function on the stack that is on its first instruction or something? So backing up the pc value for *that* frame is the problem you're solving.

Then there's the issue of backing up the pc by 1 for the frame that was asynchronously interrupted - a separate issue altogether.

Anyway this looks fine to me.

I'm guessing on this Linux system, you've got a trap receiver function on the stack that is on its first instruction or something? So backing up the pc value for *that* frame is the problem you're solving.
...
Just just to check, we've got a backtrace like...

Yes, it's just like you describe.

I think Unwind::GetFrameInfoAtIndex can take a new output argument, bool behaves_like_zeroth_frame. It would be set to true if idx == 0 or if m_frames[idx - 1].IsTrapHandlerFrame() is true.
StackFrame.h would pick up a new ivar m_behaves_like_zeroth_frame and accessor method BehavesLikeZerothFrame().
StackFrameList::GetFramesUpTo will fetch the value via unwinder->GetFrameInfoAtIndex() and pass it as an argument to the StackFrame::StackFrame ctor.
Then in StackFrame::GetSymbolContext we can use the m_behaves_like_zeroth_frame to fix the symbol lookup logic.
I'm not wedded to the behaves_like_zeroth_frame nomenclature. But I think this approach makes sense.

Makes sense to me too, thanks for the pointers. I think I'll upload that as a separate patch to keep the reviews incremental, but maybe commit them back-to-back since the second would improve any diagnostics if the first fails on other platforms.

Anyway this looks fine to me.

Is that a green light to commit the change (modulo figuring out the issue that Jan reported which I'm not able to reproduce), or does that mean LGTY but I should get sign-off from someone else as well? (Sorry if I'm being dense, just haven't committed changes to lldb before and want to make sure I understand the workflow, and noticed you didn't mark the Phab review as "accepted", wondering if that's significant)

Thanks!

jasonmolenda accepted this revision.Jul 16 2019, 2:18 PM

Sorry yes, I should have marked this as approved. I don't have a linux machine handy so I haven't looked into the problem Jan reports with the change -- fwiw if you turn on unwind logging ('log enable lldb unwind' before you backtrace at a stop), it will tell you which unwind rule it is using to find the caller's return pc value as it goes up the stack. It's a little tricky to read at first, but once you get the hang of it you can usually spot the problem quickly and understand why lldb is behaving the way it is. But just looking at the source changes you're making, I'm fine with this.

This revision is now accepted and ready to land.Jul 16 2019, 2:18 PM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptJul 19 2019, 7:09 AM

without the bogus frame in between. Do I have the wrong repro steps or something?

You did not have DWARF for glibc installed: dnf debuginfo-install glibc

Then in fact your test was better as only real frames matter for this patch. There is some LLDB bug/incompatibility decoding inline function from (GCC-produced) DWARF but that is unrelated to this patch.

You did not have DWARF for glibc installed: dnf debuginfo-install glibc

Then in fact your test was better as only real frames matter for this patch. There is some LLDB bug/incompatibility decoding inline function from (GCC-produced) DWARF but that is unrelated to this patch.

Ah, that makes sense. Thanks for following up.