This is an archive of the discontinued LLVM Phabricator instance.

[lsan] Don't handle DTLS of thread under destruction
ClosedPublic

Authored by m.ostapenko on Mar 10 2017, 4:52 AM.

Details

Summary

When debugging a testcase from https://github.com/google/sanitizers/issues/757 I've noticed that LSan segfaults with NULL dereference after ~10 seconds of execution.
It turned out that suspended thread may have dtls->dtv_size == kDestroyedThread (-1) and LSan wrongly assumes that DTV is available.
This patch doesn't resolve original issue from GitHub but allows avoid rare segfaults in thread intrusive cases.

Diff Detail

Repository
rL LLVM

Event Timeline

m.ostapenko created this revision.Mar 10 2017, 4:52 AM
m.ostapenko retitled this revision from [lsan] Don't handle DTLS of thread under distruction to [lsan] Don't handle DTLS of thread under destruction.Mar 10 2017, 5:05 AM

Fix spelling.

vitalybuka added inline comments.Mar 10 2017, 11:18 AM
lib/sanitizer_common/sanitizer_tls_get_addr.cc
140 ↗(On Diff #91316)

Please remove dtls &&
!dtls is invalid and we should crash here

148 ↗(On Diff #91316)

UNREACHABLE() ?

kcc edited edge metadata.Mar 10 2017, 1:45 PM

is a test possible at all here?

Updating according to Vitaly's nits.

When testing new patch, I've encountered another error:

==7509==Processing thread 7507.
==7509==Could not get registers from thread 7507 (errno 3).
==7509==Unable to get registers from thread 7507.
==7509==Stack at 0x7f0c71bfa000-0x7f0c723ec980 (SP = 0x7f0c71bfa000).
Tracer caught signal 11: addr=0x7f0c71bfa000 pc=0x423980 sp=0x7f0c41b99da0
==7509==Could not detach from thread 7507 (errno 3).
==25753==LeakSanitizer has encountered a fatal error.
==25753==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1
==25753==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc)

This happens because the stack memory of (destroyed) thread can be already unmapped at point of analysis thus accessing it later would lead to segfault.
I've introduced a new flag (off by default) to mitigate the issue.

Regarding to testcase, I'm afraid we'll need something like testcase from GitHub issue (infinite program with lots of threads creation/destruction) to trigger the issue, that might be too intrusive/flaky for buildbots. If it's acceptable to add such a test, I can add it for sure.

This revision is now accepted and ready to land.Mar 20 2017, 2:08 PM
kcc added a comment.Mar 20 2017, 2:23 PM

Not LG. We are adding a flag and non-trivial logic w/o a test.

After more debugging it seems that the issue is even more complicated.

Tracer caught signal 11: addr=0x7fb108da4000 pc=0x423990 sp=0x7fb135ffed90
==9109==Process memory map follows:
	0x000000400000-0x000000441000	/tmp/a.out 0x000000000005
	0x000000641000-0x000000642000	/tmp/a.out 0x000000000001
	0x000000642000-0x000000645000	/tmp/a.out 0x000000000003
...
        0x7fb1085a4000-**0x7fb108da4000**	 rw (0x000000000003)

The faulty address looks fine (addressable and accessible), but the problem seems to be pretty similar to that we saw in fast unwinder some time ago (fixed by Evgeniy with https://reviews.llvm.org/rL219683).
Possible explanation (that we considered when debugging segfault in unwinder) looks like this:

  1. Kernel maps stacks from higher addresses to lower (MAP_GROWSDOWN flag in mmap syscall).
  2. Kernel maps stacks non-atomically (i.e not all ulilmit amount of stack memory become addressable simultaneously, lower pages become available a little bit later than higher).
  3. If we make access to *stack_top (== stack_bot - ulimit) before kernel actually mapped all ulimit range, we'll have segfault.

It's hard to trigger the issue within just one process, I can reproduce the segfault only when running three test cases from https://github.com/google/sanitizers/issues/757 in parallel (thus slowing down the kernel, perhaps).
I'm trying to cook a testcase now, but I'm not sure I can reproduce the issue without running several test instances simultaneously (e.g. ./test & ./test & ./test). Is it acceptable to have such a test in compiler-rt testsuite?

kcc added a comment.Mar 21 2017, 4:49 PM

So, this means you suspect a kernel bug?
Is it known? Fixed in upstream? Reproduced on latest kernels?

lib/lsan/lsan_flags.inc
48 ↗(On Diff #91522)

when is this flag false?

eugenis edited edge metadata.Mar 22 2017, 1:33 PM

After more debugging it seems that the issue is even more complicated.

Tracer caught signal 11: addr=0x7fb108da4000 pc=0x423990 sp=0x7fb135ffed90
...
        0x7fb1085a4000-**0x7fb108da4000**	 rw (0x000000000003)

This address does not look accessible to me. Well, that depends on the next mapping.

After more debugging it seems that the issue is even more complicated.

Tracer caught signal 11: addr=0x7fb108da4000 pc=0x423990 sp=0x7fb135ffed90
...
        0x7fb1085a4000-**0x7fb108da4000**	 rw (0x000000000003)

This address does not look accessible to me. Well, that depends on the next mapping.

Oh, right, this is the highest address, not the lowest... Sorry. AFAIR on x86_64 static TLS (as well as TCB) for non-main threads should be located right above stack area (need to recheck).

Turned out I was wrong again. Looking to verbose log more carefully, one can see that segfault happens when GetRegistersAndSP fails with errno 3 (ESRCH):

==13657==Attached to thread 14860.
==13657==Attached to thread 14868.
==13657==Attached to thread 13349.
==13657==Attached to thread 13357.
==13657==Could not get registers from thread 13349 (errno 3).
==13657==Unable to get registers from thread 13349.
Tracer caught signal 11: addr=0x7ff4b259b000 pc=0x4239d0 sp=0x7ff4b1d99d90
==13657==Process memory map follows:
...
        0x7ff4b1d9b000-0x7ff4b259b000    0x000000000003 (rw)
        0x7ff4b259b000-0x7ff4b259c000    0x000000000000
...
==13657==End of process memory map.
==13657==Detached from thread 14860.
==13657==Detached from thread 14868.
==13657==Could not detach from thread 13349 (errno 3).
==13657==Detached from thread 13357.
==14860==LeakSanitizer has encountered a fatal error.
==14860==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1
==14860==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc)

Although LSan successfully attached to thread 13349, it seems that this thread was killed by concurrent SIGKILL signal. Thus stack boundaries extracted by GetRegistersAndSP are already invalid and we access "bad" memory (guard page in this case).
According to ptrace manual, when user tries to get information from ptrace stopped thread he should always be ready to handle ESRCH error:

The tracer cannot assume that the ptrace-stopped tracee exists.
There are many scenarios when the tracee may die while stopped (such
as SIGKILL).  Therefore, the tracer must be prepared to handle an
ESRCH error on any ptrace operation.  Unfortunately, the same error
is returned if the tracee exists but is not ptrace-stopped (for
commands which require a stopped tracee), or if it is not traced by
the process which issued the ptrace call.  The tracer needs to keep
track of the stopped/running state of the tracee, and interpret ESRCH
as "tracee died unexpectedly" only if it knows that the tracee has
been observed to enter ptrace-stop.  Note that there is no guarantee
that waitpid(WNOHANG) will reliably report the tracee's death status
if a ptrace operation returned ESRCH.  waitpid(WNOHANG) may return 0
instead.  In other words, the tracee may be "not yet fully dead", but
already refusing ptrace requests.

I'm adjusting the patch to handle ESRCH properly in LSan.

eugenis accepted this revision.Apr 4 2017, 4:43 PM

LGTM

lib/lsan/lsan_common.cc
208 ↗(On Diff #93888)

Any idea why is it OK to continue if the registers can not be read, and when can that happen? There is nothing in git history...

m.ostapenko added inline comments.Apr 5 2017, 8:33 AM
lib/lsan/lsan_common.cc
208 ↗(On Diff #93888)

As far as I can see in kernel code (http://lxr.free-electrons.com/source/include/linux/regset.h#L58) it seems that possible errno values despite ESRCH are EIO, EDEV and EFAULT. For x86_64 the only possible errno value is EFAULT (http://lxr.free-electrons.com/source/arch/x86/kernel/ptrace.c#L456). Frankly, I can't tell which of these errors can pop up given the fact that we've already successfully attached to the thread.

This revision was automatically updated to reflect the committed changes.