Download Raw Diff

Details

Reviewers

kcc
vitalybuka
eugenis

Commits

rGfe863a65105c: [lsan] Avoid segfaults during threads destruction under high load
rCRT299630: [lsan] Avoid segfaults during threads destruction under high load
rL299630: [lsan] Avoid segfaults during threads destruction under high load

Summary

When debugging a testcase from https://github.com/google/sanitizers/issues/757 I've noticed that LSan segfaults with NULL dereference after ~10 seconds of execution.
It turned out that suspended thread may have dtls->dtv_size == kDestroyedThread (-1) and LSan wrongly assumes that DTV is available.
This patch doesn't resolve original issue from GitHub but allows avoid rare segfaults in thread intrusive cases.

Diff Detail

Repository: rL LLVM

Event Timeline

m.ostapenko created this revision.Mar 10 2017, 4:52 AM

Herald added a subscriber: kubamracek. · View Herald TranscriptMar 10 2017, 4:52 AM

m.ostapenko retitled this revision from [lsan] Don't handle DTLS of thread under distruction to [lsan] Don't handle DTLS of thread under destruction.Mar 10 2017, 5:05 AM

Fix spelling.

vitalybuka added inline comments.Mar 10 2017, 11:18 AM

lib/sanitizer_common/sanitizer_tls_get_addr.cc
140 ↗	(On Diff #91316)	Please remove dtls && !dtls is invalid and we should crash here
148 ↗	(On Diff #91316)	UNREACHABLE() ?

is a test possible at all here?

Updating according to Vitaly's nits.

When testing new patch, I've encountered another error:

==7509==Processing thread 7507.
==7509==Could not get registers from thread 7507 (errno 3).
==7509==Unable to get registers from thread 7507.
==7509==Stack at 0x7f0c71bfa000-0x7f0c723ec980 (SP = 0x7f0c71bfa000).
Tracer caught signal 11: addr=0x7f0c71bfa000 pc=0x423980 sp=0x7f0c41b99da0
==7509==Could not detach from thread 7507 (errno 3).
==25753==LeakSanitizer has encountered a fatal error.
==25753==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1
==25753==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc)

This happens because the stack memory of (destroyed) thread can be already unmapped at point of analysis thus accessing it later would lead to segfault.
I've introduced a new flag (off by default) to mitigate the issue.

Regarding to testcase, I'm afraid we'll need something like testcase from GitHub issue (infinite program with lots of threads creation/destruction) to trigger the issue, that might be too intrusive/flaky for buildbots. If it's acceptable to add such a test, I can add it for sure.

Ping.

LGTM

This revision is now accepted and ready to land.Mar 20 2017, 2:08 PM

Not LG. We are adding a flag and non-trivial logic w/o a test.

After more debugging it seems that the issue is even more complicated.

Tracer caught signal 11: addr=0x7fb108da4000 pc=0x423990 sp=0x7fb135ffed90
==9109==Process memory map follows:
	0x000000400000-0x000000441000	/tmp/a.out 0x000000000005
	0x000000641000-0x000000642000	/tmp/a.out 0x000000000001
	0x000000642000-0x000000645000	/tmp/a.out 0x000000000003
...
        0x7fb1085a4000-**0x7fb108da4000**	 rw (0x000000000003)

The faulty address looks fine (addressable and accessible), but the problem seems to be pretty similar to that we saw in fast unwinder some time ago (fixed by Evgeniy with https://reviews.llvm.org/rL219683).
Possible explanation (that we considered when debugging segfault in unwinder) looks like this:

Kernel maps stacks from higher addresses to lower (MAP_GROWSDOWN flag in mmap syscall).
Kernel maps stacks non-atomically (i.e not all ulilmit amount of stack memory become addressable simultaneously, lower pages become available a little bit later than higher).
If we make access to *stack_top (== stack_bot - ulimit) before kernel actually mapped all ulimit range, we'll have segfault.

It's hard to trigger the issue within just one process, I can reproduce the segfault only when running three test cases from https://github.com/google/sanitizers/issues/757 in parallel (thus slowing down the kernel, perhaps).
I'm trying to cook a testcase now, but I'm not sure I can reproduce the issue without running several test instances simultaneously (e.g. ./test & ./test & ./test). Is it acceptable to have such a test in compiler-rt testsuite?

So, this means you suspect a kernel bug?
Is it known? Fixed in upstream? Reproduced on latest kernels?

lib/lsan/lsan_flags.inc
48 ↗	(On Diff #91522)	when is this flag false?

In D30818#706590, @m.ostapenko wrote:
After more debugging it seems that the issue is even more complicated.
Tracer caught signal 11: addr=0x7fb108da4000 pc=0x423990 sp=0x7fb135ffed90
...
        0x7fb1085a4000-**0x7fb108da4000**	 rw (0x000000000003)

This address does not look accessible to me. Well, that depends on the next mapping.

In D30818#707959, @eugenis wrote:
In D30818#706590, @m.ostapenko wrote:
After more debugging it seems that the issue is even more complicated.
Tracer caught signal 11: addr=0x7fb108da4000 pc=0x423990 sp=0x7fb135ffed90
...
        0x7fb1085a4000-**0x7fb108da4000**	 rw (0x000000000003)
This address does not look accessible to me. Well, that depends on the next mapping.

Oh, right, this is the highest address, not the lowest... Sorry. AFAIR on x86_64 static TLS (as well as TCB) for non-main threads should be located right above stack area (need to recheck).

Turned out I was wrong again. Looking to verbose log more carefully, one can see that segfault happens when GetRegistersAndSP fails with errno 3 (ESRCH):

==13657==Attached to thread 14860.
==13657==Attached to thread 14868.
==13657==Attached to thread 13349.
==13657==Attached to thread 13357.
==13657==Could not get registers from thread 13349 (errno 3).
==13657==Unable to get registers from thread 13349.
Tracer caught signal 11: addr=0x7ff4b259b000 pc=0x4239d0 sp=0x7ff4b1d99d90
==13657==Process memory map follows:
...
        0x7ff4b1d9b000-0x7ff4b259b000    0x000000000003 (rw)
        0x7ff4b259b000-0x7ff4b259c000    0x000000000000
...
==13657==End of process memory map.
==13657==Detached from thread 14860.
==13657==Detached from thread 14868.
==13657==Could not detach from thread 13349 (errno 3).
==13657==Detached from thread 13357.
==14860==LeakSanitizer has encountered a fatal error.
==14860==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1
==14860==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc)

Although LSan successfully attached to thread 13349, it seems that this thread was killed by concurrent SIGKILL signal. Thus stack boundaries extracted by GetRegistersAndSP are already invalid and we access "bad" memory (guard page in this case).
According to ptrace manual, when user tries to get information from ptrace stopped thread he should always be ready to handle ESRCH error:

The tracer cannot assume that the ptrace-stopped tracee exists.
There are many scenarios when the tracee may die while stopped (such
as SIGKILL).  Therefore, the tracer must be prepared to handle an
ESRCH error on any ptrace operation.  Unfortunately, the same error
is returned if the tracee exists but is not ptrace-stopped (for
commands which require a stopped tracee), or if it is not traced by
the process which issued the ptrace call.  The tracer needs to keep
track of the stopped/running state of the tracee, and interpret ESRCH
as "tracee died unexpectedly" only if it knows that the tracee has
been observed to enter ptrace-stop.  Note that there is no guarantee
that waitpid(WNOHANG) will reliably report the tracee's death status
if a ptrace operation returned ESRCH.  waitpid(WNOHANG) may return 0
instead.  In other words, the tracee may be "not yet fully dead", but
already refusing ptrace requests.

I'm adjusting the patch to handle ESRCH properly in LSan.

LGTM

lib/lsan/lsan_common.cc
208 ↗	(On Diff #93888)	Any idea why is it OK to continue if the registers can not be read, and when can that happen? There is nothing in git history...

m.ostapenko added inline comments.Apr 5 2017, 8:33 AM

lib/lsan/lsan_common.cc
208 ↗	(On Diff #93888)	As far as I can see in kernel code (http://lxr.free-electrons.com/source/include/linux/regset.h#L58) it seems that possible errno values despite ESRCH are EIO, EDEV and EFAULT. For x86_64 the only possible errno value is EFAULT (http://lxr.free-electrons.com/source/arch/x86/kernel/ptrace.c#L456). Frankly, I can't tell which of these errors can pop up given the fact that we've already successfully attached to the thread.

Closed by commit rL299630: [lsan] Avoid segfaults during threads destruction under high load (authored by chefmax). · Explain WhyApr 6 2017, 12:55 AM

This revision was automatically updated to reflect the committed changes.

Diff 94324

compiler-rt/trunk/lib/lsan/lsan_common.cc

Show First 20 Lines • Show All 195 Lines • ▼ Show 20 Lines	bool thread_found = GetThreadRangesLocked(os_id, &stack_begin, &stack_end,
&cache_begin, &cache_end, &dtls);		&cache_begin, &cache_end, &dtls);
if (!thread_found) {		if (!thread_found) {
// If a thread can't be found in the thread registry, it's probably in the		// If a thread can't be found in the thread registry, it's probably in the
// process of destruction. Log this event and move on.		// process of destruction. Log this event and move on.
LOG_THREADS("Thread %d not found in registry.\n", os_id);		LOG_THREADS("Thread %d not found in registry.\n", os_id);
continue;		continue;
}		}
uptr sp;		uptr sp;
bool have_registers =		PtraceRegistersStatus have_registers =
(suspended_threads.GetRegistersAndSP(i, registers.data(), &sp) == 0);		suspended_threads.GetRegistersAndSP(i, registers.data(), &sp);
if (!have_registers) {		if (have_registers != REGISTERS_AVAILABLE) {
Report("Unable to get registers from thread %d.\n");		Report("Unable to get registers from thread %d.\n", os_id);
// If unable to get SP, consider the entire stack to be reachable.		// If unable to get SP, consider the entire stack to be reachable unless
		// GetRegistersAndSP failed with ESRCH.
		if (have_registers == REGISTERS_UNAVAILABLE_FATAL) continue;
sp = stack_begin;		sp = stack_begin;
}		}

if (flags()->use_registers && have_registers)		if (flags()->use_registers && have_registers)
ScanRangeForPointers(registers_begin, registers_end, frontier,		ScanRangeForPointers(registers_begin, registers_end, frontier,
"REGISTERS", kReachable);		"REGISTERS", kReachable);

if (flags()->use_stacks) {		if (flags()->use_stacks) {
Show All 31 Lines	if (flags()->use_tls) {
CHECK_LE(tls_begin, cache_begin);		CHECK_LE(tls_begin, cache_begin);
CHECK_GE(tls_end, cache_end);		CHECK_GE(tls_end, cache_end);
if (tls_begin < cache_begin)		if (tls_begin < cache_begin)
ScanRangeForPointers(tls_begin, cache_begin, frontier, "TLS",		ScanRangeForPointers(tls_begin, cache_begin, frontier, "TLS",
kReachable);		kReachable);
if (tls_end > cache_end)		if (tls_end > cache_end)
ScanRangeForPointers(cache_end, tls_end, frontier, "TLS", kReachable);		ScanRangeForPointers(cache_end, tls_end, frontier, "TLS", kReachable);
}		}
if (dtls) {		if (dtls && !DTLSInDestruction(dtls)) {
for (uptr j = 0; j < dtls->dtv_size; ++j) {		for (uptr j = 0; j < dtls->dtv_size; ++j) {
uptr dtls_beg = dtls->dtv[j].beg;		uptr dtls_beg = dtls->dtv[j].beg;
uptr dtls_end = dtls_beg + dtls->dtv[j].size;		uptr dtls_end = dtls_beg + dtls->dtv[j].size;
if (dtls_beg < dtls_end) {		if (dtls_beg < dtls_end) {
LOG_THREADS("DTLS %zu at %p-%p.\n", j, dtls_beg, dtls_end);		LOG_THREADS("DTLS %zu at %p-%p.\n", j, dtls_beg, dtls_end);
ScanRangeForPointers(dtls_beg, dtls_end, frontier, "DTLS",		ScanRangeForPointers(dtls_beg, dtls_end, frontier, "DTLS",
kReachable);		kReachable);
}		}
}		}
		} else {
		// We are handling a thread with DTLS under destruction. Log about
		// this and continue.
		LOG_THREADS("Thread %d has DTLS under destruction.\n", os_id);
}		}
}		}
}		}
}		}

static void ProcessRootRegion(Frontier *frontier, uptr root_begin,		static void ProcessRootRegion(Frontier *frontier, uptr root_begin,
uptr root_end) {		uptr root_end) {
MemoryMappingLayout proc_maps(/cache_enabled/true);		MemoryMappingLayout proc_maps(/cache_enabled/true);
▲ Show 20 Lines • Show All 489 Lines • Show Last 20 Lines

compiler-rt/trunk/lib/sanitizer_common/sanitizer_stoptheworld.h

	Show All 14 Lines
	#define SANITIZER_STOPTHEWORLD_H			#define SANITIZER_STOPTHEWORLD_H

	#include "sanitizer_internal_defs.h"			#include "sanitizer_internal_defs.h"
	#include "sanitizer_common.h"			#include "sanitizer_common.h"

	namespace __sanitizer {			namespace __sanitizer {
	typedef int SuspendedThreadID;			typedef int SuspendedThreadID;

				enum PtraceRegistersStatus {
				REGISTERS_UNAVAILABLE_FATAL = -1,
				REGISTERS_UNAVAILABLE = 0,
				REGISTERS_AVAILABLE = 1
				};

	// Holds the list of suspended threads and provides an interface to dump their			// Holds the list of suspended threads and provides an interface to dump their
	// register contexts.			// register contexts.
	class SuspendedThreadsList {			class SuspendedThreadsList {
	public:			public:
	SuspendedThreadsList()			SuspendedThreadsList()
	: thread_ids_(1024) {}			: thread_ids_(1024) {}
	SuspendedThreadID GetThreadID(uptr index) const {			SuspendedThreadID GetThreadID(uptr index) const {
	CHECK_LT(index, thread_ids_.size());			CHECK_LT(index, thread_ids_.size());
	return thread_ids_[index];			return thread_ids_[index];
	}			}
	int GetRegistersAndSP(uptr index, uptr buffer, uptr sp) const;			PtraceRegistersStatus GetRegistersAndSP(uptr index, uptr *buffer,
				uptr *sp) const;
	// The buffer in GetRegistersAndSP should be at least this big.			// The buffer in GetRegistersAndSP should be at least this big.
	static uptr RegisterCount();			static uptr RegisterCount();
	uptr thread_count() const { return thread_ids_.size(); }			uptr thread_count() const { return thread_ids_.size(); }
	bool Contains(SuspendedThreadID thread_id) const {			bool Contains(SuspendedThreadID thread_id) const {
	for (uptr i = 0; i < thread_ids_.size(); i++) {			for (uptr i = 0; i < thread_ids_.size(); i++) {
	if (thread_ids_[i] == thread_id)			if (thread_ids_[i] == thread_id)
	return true;			return true;
	}			}
	Show All 28 Lines

compiler-rt/trunk/lib/sanitizer_common/sanitizer_stoptheworld_linux_libcdep.cc

	Show First 20 Lines • Show All 487 Lines • ▼ Show 20 Lines
	typedef _user_regs_struct regs_struct;			typedef _user_regs_struct regs_struct;
	#define REG_SP gprs[15]			#define REG_SP gprs[15]
	#define ARCH_IOVEC_FOR_GETREGSET			#define ARCH_IOVEC_FOR_GETREGSET

	#else			#else
	#error "Unsupported architecture"			#error "Unsupported architecture"
	#endif // SANITIZER_ANDROID && defined(__arm__)			#endif // SANITIZER_ANDROID && defined(__arm__)

	int SuspendedThreadsList::GetRegistersAndSP(uptr index,			PtraceRegistersStatus SuspendedThreadsList::GetRegistersAndSP(uptr index,
	uptr *buffer,			uptr *buffer,
	uptr *sp) const {			uptr *sp) const {
	pid_t tid = GetThreadID(index);			pid_t tid = GetThreadID(index);
	regs_struct regs;			regs_struct regs;
	int pterrno;			int pterrno;
	#ifdef ARCH_IOVEC_FOR_GETREGSET			#ifdef ARCH_IOVEC_FOR_GETREGSET
	struct iovec regset_io;			struct iovec regset_io;
	regset_io.iov_base = &regs;			regset_io.iov_base = &regs;
	regset_io.iov_len = sizeof(regs_struct);			regset_io.iov_len = sizeof(regs_struct);
	bool isErr = internal_iserror(internal_ptrace(PTRACE_GETREGSET, tid,			bool isErr = internal_iserror(internal_ptrace(PTRACE_GETREGSET, tid,
	(void)NT_PRSTATUS, (void)&regset_io),			(void)NT_PRSTATUS, (void)&regset_io),
	&pterrno);			&pterrno);
	#else			#else
	bool isErr = internal_iserror(internal_ptrace(PTRACE_GETREGS, tid, nullptr,			bool isErr = internal_iserror(internal_ptrace(PTRACE_GETREGS, tid, nullptr,
	&regs), &pterrno);			&regs), &pterrno);
	#endif			#endif
	if (isErr) {			if (isErr) {
	VReport(1, "Could not get registers from thread %d (errno %d).\n", tid,			VReport(1, "Could not get registers from thread %d (errno %d).\n", tid,
	pterrno);			pterrno);
	return -1;			// ESRCH means that the given thread is not suspended or already dead.
				// Therefore it's unsafe to inspect its data (e.g. walk through stack) and
				// we should notify caller about this.
				return pterrno == ESRCH ? REGISTERS_UNAVAILABLE_FATAL
				: REGISTERS_UNAVAILABLE;
	}			}

	*sp = regs.REG_SP;			*sp = regs.REG_SP;
	internal_memcpy(buffer, &regs, sizeof(regs));			internal_memcpy(buffer, &regs, sizeof(regs));
	return 0;			return REGISTERS_AVAILABLE;
	}			}

	uptr SuspendedThreadsList::RegisterCount() {			uptr SuspendedThreadsList::RegisterCount() {
	return sizeof(regs_struct) / sizeof(uptr);			return sizeof(regs_struct) / sizeof(uptr);
	}			}
	} // namespace __sanitizer			} // namespace __sanitizer

	#endif // SANITIZER_LINUX && (defined(__x86_64__) \|\| defined(__mips__)			#endif // SANITIZER_LINUX && (defined(__x86_64__) \|\| defined(__mips__)
	// \|\| defined(__aarch64__) \|\| defined(__powerpc64__)			// \|\| defined(__aarch64__) \|\| defined(__powerpc64__)
	// \|\| defined(__s390__) \|\| defined(__i386__)			// \|\| defined(__s390__) \|\| defined(__i386__)

compiler-rt/trunk/lib/sanitizer_common/sanitizer_tls_get_addr.h

	Show First 20 Lines • Show All 49 Lines • ▼ Show 20 Lines

	// Returns pointer and size of a linker-allocated TLS block.			// Returns pointer and size of a linker-allocated TLS block.
	// Each block is returned exactly once.			// Each block is returned exactly once.
	DTLS::DTV DTLS_on_tls_get_addr(void arg, void *res, uptr static_tls_begin,			DTLS::DTV DTLS_on_tls_get_addr(void arg, void *res, uptr static_tls_begin,
	uptr static_tls_end);			uptr static_tls_end);
	void DTLS_on_libc_memalign(void *ptr, uptr size);			void DTLS_on_libc_memalign(void *ptr, uptr size);
	DTLS *DTLS_Get();			DTLS *DTLS_Get();
	void DTLS_Destroy(); // Make sure to call this before the thread is destroyed.			void DTLS_Destroy(); // Make sure to call this before the thread is destroyed.
				// Returns true if DTLS of suspended thread is in destruction process.
				bool DTLSInDestruction(DTLS *dtls);

	} // namespace __sanitizer			} // namespace __sanitizer

	#endif // SANITIZER_TLS_GET_ADDR_H			#endif // SANITIZER_TLS_GET_ADDR_H

compiler-rt/trunk/lib/sanitizer_common/sanitizer_tls_get_addr.cc

Show First 20 Lines • Show All 130 Lines • ▼ Show 20 Lines	void DTLS_on_libc_memalign(void *ptr, uptr size) {
if (!common_flags()->intercept_tls_get_addr) return;		if (!common_flags()->intercept_tls_get_addr) return;
VPrintf(2, "DTLS_on_libc_memalign: %p %p\n", ptr, size);		VPrintf(2, "DTLS_on_libc_memalign: %p %p\n", ptr, size);
dtls.last_memalign_ptr = reinterpret_cast<uptr>(ptr);		dtls.last_memalign_ptr = reinterpret_cast<uptr>(ptr);
dtls.last_memalign_size = size;		dtls.last_memalign_size = size;
}		}

DTLS *DTLS_Get() { return &dtls; }		DTLS *DTLS_Get() { return &dtls; }

		bool DTLSInDestruction(DTLS *dtls) {
		return dtls->dtv_size == kDestroyedThread;
		}

#else		#else
void DTLS_on_libc_memalign(void *ptr, uptr size) {}		void DTLS_on_libc_memalign(void *ptr, uptr size) {}
DTLS::DTV DTLS_on_tls_get_addr(void arg, void *res) { return 0; }		DTLS::DTV DTLS_on_tls_get_addr(void arg, void *res) { return 0; }
DTLS *DTLS_Get() { return 0; }		DTLS *DTLS_Get() { return 0; }
void DTLS_Destroy() {}		void DTLS_Destroy() {}
		bool DTLSInDestruction(DTLS *dtls) { UNREACHABLE(); }

#endif // SANITIZER_INTERCEPT_TLS_GET_ADDR		#endif // SANITIZER_INTERCEPT_TLS_GET_ADDR

} // namespace __sanitizer		} // namespace __sanitizer

This is an archive of the discontinued LLVM Phabricator instance.

[lsan] Don't handle DTLS of thread under destruction
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 94324

compiler-rt/trunk/lib/lsan/lsan_common.cc

compiler-rt/trunk/lib/sanitizer_common/sanitizer_stoptheworld.h

compiler-rt/trunk/lib/sanitizer_common/sanitizer_stoptheworld_linux_libcdep.cc

compiler-rt/trunk/lib/sanitizer_common/sanitizer_tls_get_addr.h

compiler-rt/trunk/lib/sanitizer_common/sanitizer_tls_get_addr.cc

This is an archive of the discontinued LLVM Phabricator instance.

[lsan] Don't handle DTLS of thread under destructionClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 94324

compiler-rt/trunk/lib/lsan/lsan_common.cc

compiler-rt/trunk/lib/sanitizer_common/sanitizer_stoptheworld.h

compiler-rt/trunk/lib/sanitizer_common/sanitizer_stoptheworld_linux_libcdep.cc

compiler-rt/trunk/lib/sanitizer_common/sanitizer_tls_get_addr.h

compiler-rt/trunk/lib/sanitizer_common/sanitizer_tls_get_addr.cc

[lsan] Don't handle DTLS of thread under destruction
ClosedPublic