This is an archive of the discontinued LLVM Phabricator instance.

Fix the bug of garbage collection of siod.
ClosedPublic

Authored by LuoYuanke on Mar 30 2019, 6:36 PM.

Details

Summary

In the siod gc, it depends on setjmp(...) to get the value of callee saved register value, and traverse those register to get the possible local object pointer. Also it traverse current stack to get the possible local object pointer. For setjmp(...) on X86-64, the rbp register (callee saved register) is NOT saved in the setjmp buffer, so object that pointed by rbp is NOT considered as local object variable and its memory is collected as garbage. This patch is to use getcontext(...) to get more register value of current context and traverse those register to protect object from garbage collection.

This bug is not easy to expose, because usually rbp has been saved in stack when do garbage collection, so the object pointer can be scanned from stack. However when compiler do some optimization on register allocation or licm, the rbp live in gc_mark_and_sweep(...) and rbp is pointing to an object. In such situation, siod failed to run test.scm because object is collected as garbage unexpectedly.

Diff Detail

Repository
rL LLVM

Event Timeline

LuoYuanke created this revision.Mar 30 2019, 6:36 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 30 2019, 6:36 PM

Are sure RBP isn't in the jmp buffer? There's a macro used by setjmp called PTR_MANGLE that can xor the pointer with another variable which can obscure the value. It gets xored again in longjmp to demangle it. But in either case its not a good idea to examine the contents of jmp_buf so I think this is the right fix.

Are sure RBP isn't in the jmp buffer? There's a macro used by setjmp called PTR_MANGLE that can xor the pointer with another variable which can obscure the value. It gets xored again in longjmp to demangle it. But in either case its not a good idea to examine the contents of jmp_buf so I think this is the right fix.

Yes. I'm sure. There is two evidence that shows rbp is not saved in jmp buffer on my machine. First I set a breakpoint before calling setjmp and dump the register in gdb. After calling setjmp, I also dump the memory of jmp buffer and it shows the rbp value is not saved in jmp buffer. Second I disassemble the setjmp code as below and there is no code to save rbp in jmp buffer. Also when app fails, the failed object pointer is the same value of rbp which shows in gdb before setjmp.

Breakpoint 1, 0x00007ffff6f94040 in _setjmp () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-196.el7.x86_64
(gdb) disassemble
Dump of assembler code for function _setjmp:
=> 0x00007ffff6f94040 <+0>:     xor    %esi,%esi
   0x00007ffff6f94042 <+2>:     jmpq   0x7ffff6f93fa0 <__sigsetjmp>
End of assembler dump.
(gdb) disassemble __sigsetjmp
Dump of assembler code for function __sigsetjmp:
   0x00007ffff6f93fa0 <+0>:     mov    %rbx,(%rdi)
   0x00007ffff6f93fa3 <+3>:     mov    %rbp,%rax
   0x00007ffff6f93fa6 <+6>:     xor    %fs:0x30,%rax
   0x00007ffff6f93faf <+15>:    rol    $0x11,%rax
   0x00007ffff6f93fb3 <+19>:    mov    %rax,0x8(%rdi)
   0x00007ffff6f93fb7 <+23>:    mov    %r12,0x10(%rdi)
   0x00007ffff6f93fbb <+27>:    mov    %r13,0x18(%rdi)
   0x00007ffff6f93fbf <+31>:    mov    %r14,0x20(%rdi)
   0x00007ffff6f93fc3 <+35>:    mov    %r15,0x28(%rdi)
   0x00007ffff6f93fc7 <+39>:    lea    0x8(%rsp),%rdx
   0x00007ffff6f93fcc <+44>:    xor    %fs:0x30,%rdx
   0x00007ffff6f93fd5 <+53>:    rol    $0x11,%rdx
   0x00007ffff6f93fd9 <+57>:    mov    %rdx,0x30(%rdi)
   0x00007ffff6f93fdd <+61>:    mov    (%rsp),%rax
   0x00007ffff6f93fe1 <+65>:    nop
   0x00007ffff6f93fe2 <+66>:    xor    %fs:0x30,%rax
   0x00007ffff6f93feb <+75>:    rol    $0x11,%rax
   0x00007ffff6f93fef <+79>:    mov    %rax,0x38(%rdi)
   0x00007ffff6f93ff3 <+83>:    jmpq   0x7ffff6f94000 <__sigjmp_save>

(gdb) disassemble __sigjmp_save
Dump of assembler code for function __sigjmp_save:
   0x00007ffff6f94000 <+0>:     xor    %eax,%eax
   0x00007ffff6f94002 <+2>:     test   %esi,%esi
   0x00007ffff6f94004 <+4>:     push   %rbx
   0x00007ffff6f94005 <+5>:     mov    %rdi,%rbx
   0x00007ffff6f94008 <+8>:     je     0x7ffff6f9401f <__sigjmp_save+31>
   0x00007ffff6f9400a <+10>:    lea    0x48(%rdi),%rdx
   0x00007ffff6f9400e <+14>:    xor    %esi,%esi
   0x00007ffff6f94010 <+16>:    xor    %edi,%edi
   0x00007ffff6f94012 <+18>:    callq  0x7ffff6f94490 <sigprocmask>
   0x00007ffff6f94017 <+23>:    test   %eax,%eax
   0x00007ffff6f94019 <+25>:    sete   %al
   0x00007ffff6f9401c <+28>:    movzbl %al,%eax
   0x00007ffff6f9401f <+31>:    mov    %eax,0x40(%rbx)
   0x00007ffff6f94022 <+34>:    xor    %eax,%eax
   0x00007ffff6f94024 <+36>:    pop    %rbx
   0x00007ffff6f94025 <+37>:    retq
End of assembler dump.

I believe this is saving rbp but mangling it while doing so in a reversible way.

0x00007ffff6f93fa3 <+3>:     mov    %rbp,%rax
0x00007ffff6f93fa6 <+6>:     xor    %fs:0x30,%rax
0x00007ffff6f93faf <+15>:    rol    $0x11,%rax
0x00007ffff6f93fb3 <+19>:    mov    %rax,0x8(%rdi)

I believe this is saving rbp but mangling it while doing so in a reversible way.

0x00007ffff6f93fa3 <+3>:     mov    %rbp,%rax
0x00007ffff6f93fa6 <+6>:     xor    %fs:0x30,%rax
0x00007ffff6f93faf <+15>:    rol    $0x11,%rax
0x00007ffff6f93fb3 <+19>:    mov    %rax,0x8(%rdi)

Yes. It is demangled on __longjmp. But anyway in the jmp buffer the rbp value is mangled, so it is not object pointer any more.

(gdb) disassemble __longjmp
Dump of assembler code for function __longjmp:
   0x00007ffff6f94090 <+0>:     mov    0x30(%rdi),%r8
   0x00007ffff6f94094 <+4>:     mov    0x8(%rdi),%r9
   0x00007ffff6f94098 <+8>:     mov    0x38(%rdi),%rdx
   0x00007ffff6f9409c <+12>:    ror    $0x11,%r8
   0x00007ffff6f940a0 <+16>:    xor    %fs:0x30,%r8
   0x00007ffff6f940a9 <+25>:    ror    $0x11,%r9
   0x00007ffff6f940ad <+29>:    xor    %fs:0x30,%r9
   0x00007ffff6f940b6 <+38>:    ror    $0x11,%rdx
   0x00007ffff6f940ba <+42>:    xor    %fs:0x30,%rdx
   0x00007ffff6f940c3 <+51>:    nop
   0x00007ffff6f940c4 <+52>:    mov    (%rdi),%rbx
   0x00007ffff6f940c7 <+55>:    mov    0x10(%rdi),%r12
   0x00007ffff6f940cb <+59>:    mov    0x18(%rdi),%r13
   0x00007ffff6f940cf <+63>:    mov    0x20(%rdi),%r14
   0x00007ffff6f940d3 <+67>:    mov    0x28(%rdi),%r15
   0x00007ffff6f940d7 <+71>:    mov    %esi,%eax
   0x00007ffff6f940d9 <+73>:    mov    %r8,%rsp
   0x00007ffff6f940dc <+76>:    mov    %r9,%rbp
   0x00007ffff6f940df <+79>:    nop
   0x00007ffff6f940e0 <+80>:    jmpq   *%rdx
End of assembler dump.
rnk added a comment.Apr 2 2019, 11:29 AM

I think your analysis is correct and the fix seems reasonable.
I don't know what our policy is about making changes to test-suite or who would be a good owner to approve your changes, and as a result I'm hesitant to approve this.

FWIW, glibc/sysdeps/x86_64/__longjmp.S since Dec 22, 2005 mangles rbp, rsp and rip. This file was checked in into test-suite in 2003.
sysdeps/unix/sysv/linux/x86_64/getcontext.S doesn't mangle pointers, thus this change seems reasonable.

I would question whether it's actually worth it to attempt to keep software working that seems to have been abandoned for a decade as part of the llvm test suite? Should this code just be removed, instead of patched?

I'd note that not all platforms support getcontext -- at least Windows and OpenBSD don't. I don't actually know if this is expected to work on those platforms, but it definitely won't after this change.

MultiSource/Applications/siod/slib.c
132 ↗(On Diff #192996)

Renaming the global seems unfortunate; "save_regs_gc_mark" was a more descriptive name than "ucontext".

hfinkel added a subscriber: hfinkel.Apr 3 2019, 1:29 PM

I would question whether it's actually worth it to attempt to keep software working that seems to have been abandoned for a decade as part of the llvm test suite? Should this code just be removed, instead of patched?

Do we have anything else in the test suite that is substantially similar to this test? The problem, of course, is that each test-suite application is a representative of a class of applications with similar code regions, and that class of applications has known size. Given that getting code in the test suite which is covered by reasonable tests is non-trivial, it's generally worth fixing what we have to be correct.

I would question whether it's actually worth it to attempt to keep software working that seems to have been abandoned for a decade as part of the llvm test suite? Should this code just be removed, instead of patched?

I'd note that not all platforms support getcontext -- at least Windows and OpenBSD don't. I don't actually know if this is expected to work on those platforms, but it definitely won't after this change.

To be conservatively, we may just fix it for x86_64 on linux.

#if defined(__x86_64__)
 // On linux, rbp is mangled in jmp buffer, so object that pointed by rbp
 // is collected as garbage. We save rbp to stack_end so that the object
 // pointer can be found when scanning the stack.
 asm volatile ("movq %%rbp, %0":"=m"(stack_end)::);
#endif

Or we just fix it on linux

#if defined(linux)
use getcontext
#else
use setjmp
#endif

I prefer to fixing it on linux, because some register for other arch is also mangled in setjmp buffer.

LuoYuanke updated this revision to Diff 193647.Apr 3 2019, 8:11 PM

Use getcontext() only on linux platform.

MaskRay added inline comments.Apr 3 2019, 8:39 PM
MultiSource/Applications/siod/slib.c
1273 ↗(On Diff #193647)

#ifdef __GLIBC__

musl doesn't mangle RIP,RSP,RBP.

LuoYuanke added inline comments.Apr 3 2019, 9:37 PM
MultiSource/Applications/siod/slib.c
1273 ↗(On Diff #193647)

Thank you for the hint. But I don't find macroGLIBC with "cpp -dM". Is GLIBC an internal macro of compiler?
I assume on linux getcontext() is supported, so whenever getcontext() is supported we can use it.
So how about just fix the comments to tell developer on glibc (instead of on linux) register is mangled in jmp buffer?

MaskRay added inline comments.Apr 4 2019, 12:04 AM
MultiSource/Applications/siod/slib.c
1273 ↗(On Diff #193647)

glibc stdio.h includes features.h which defines __GLIBC__.

LuoYuanke marked an inline comment as done.Apr 7 2019, 6:42 PM
LuoYuanke added inline comments.
MultiSource/Applications/siod/slib.c
1273 ↗(On Diff #193647)

I think we prefer getcontext() to setjmp() whenever getcontext() is available, because getcontext() get more register values and it is more conservetive for garbage collection. Agree?

MaskRay added inline comments.Apr 7 2019, 7:22 PM
MultiSource/Applications/siod/slib.c
1273 ↗(On Diff #193647)

Using getcontext on Linux will make it uncompilable with musl.

musl doesn't implement makecontext/getcontext/swapcontext/setcontext. (The rationale was that the ucontext API saves/restores the signal mask as part of context switching, which inherently requires a syscall (SYS_rt_sigprocmask), making it unclear whether it outperforms threads. POSIX obsoleted and then removed it because the function prototype makes use of an obsolescent feature (function declarators with empty parentheses) of ISO C).

each test-suite application is a representative of a class of applications with similar code regions, and that class of applications has known size

Since you found the bug. Do you know if it is severe enough to make it less representative on Linux. If yes, it has probably been unnoticed for a decade. Maybe we should find something to replace siod.

LuoYuanke marked an inline comment as done.Apr 7 2019, 7:34 PM
LuoYuanke added inline comments.
MultiSource/Applications/siod/slib.c
1273 ↗(On Diff #193647)

Sorry. I don't know getcontext() is unsupported in other libc libraries on linux. I will apply your suggestion to use GLIBC.

If you change to condition on GLIBC instead of linux, I think this is fine to commit (although it still seems unfortunate to me that we're carrying an abandoned codebase which depends on weird broken stuff like this...)

LuoYuanke updated this revision to Diff 194128.Apr 8 2019, 6:13 AM

Fix the issue only when GLIBC is defined.

If you change to condition on GLIBC instead of linux, I think this is fine to commit (although it still seems unfortunate to me that we're carrying an abandoned codebase which depends on weird broken stuff like this...)

Change the condition on GLIBC instead of linux.

jyknight accepted this revision.Apr 8 2019, 7:56 AM
This revision is now accepted and ready to land.Apr 8 2019, 7:56 AM

@jyknight
I don't have permission to commit the patch. Could you help to commit it? Thanks.

This revision was automatically updated to reflect the committed changes.

@jyknight The test-suite is still in subversion and mirrored on github right, but not part of the regular llvm mono-repo? Is there a script to commit to it from the github mirror or you need to use svn or git-svn?