This is an archive of the discontinued LLVM Phabricator instance.

tsan: fix signal handling during stop-the-world
AbandonedPublic

Authored by dvyukov on Mar 3 2015, 5:16 AM.

Details

Reviewers
earthdok
Summary

Long story short: stop-the-world briefly resets SIGSEGV handler to SIG_DFL.
This breaks programs that handle and continue after SIGSEGV (namely JVM).
See the test and comments for details.

This is reincarnation of reverted r229678 (http://reviews.llvm.org/D7722).
Changed:

  • execute TracerThreadDieCallback only on tracer thread
  • reset global data in TracerThreadSignalHandler/TracerThreadDieCallback
  • handle EINTR from waitpid

Add 3 new test:

  • SIGSEGV during leak checking
  • StopTheWorld operation during signal storm from an external process
  • StopTheWorld operation when the program generates and handles SIGSEGVs

Diff Detail

Event Timeline

dvyukov updated this revision to Diff 21097.Mar 3 2015, 5:16 AM
dvyukov retitled this revision from to tsan: fix signal handling during stop-the-world.
dvyukov updated this object.
dvyukov edited the test plan for this revision. (Show Details)
dvyukov added a reviewer: earthdok.
dvyukov added a subscriber: Unknown Object (MLST).
earthdok added inline comments.Mar 3 2015, 7:09 AM
lib/sanitizer_common/sanitizer_stoptheworld_linux_libcdep.cc
416

Wrong. You're assuming that internal_waitpid() will not touch errno. This is true only on x86_64 Linux where we have our custom wrappers implemented. Other platforms (MIPS, ARM) use the libc implementation (see sanitizer_syscall_generic.inc).

test/asan/TestCases/Linux/leak_check_segv.cc
2

-pthread is unnecessary with asan
if you want to use it at least be consistent (-pthread/-lpthread)

test/asan/TestCases/Linux/signal_during_stop_the_world.cc
5

see above

37

Why these two signals in particular?

43

This barely makes a difference, the time spent attaching to threads will probably be >> this.

If you want to slow down leak detection you could make a few large allocs and fill the memory with pointers to them.

54

There's zero guarantee that it will crash...

dvyukov updated this revision to Diff 21106.Mar 3 2015, 8:02 AM

PTAL

lib/sanitizer_common/sanitizer_stoptheworld_linux_libcdep.cc
416

I am assuming nothing. This is existing code. I just properly handle EINTR.
Is there a simple way to fix other platforms?

test/asan/TestCases/Linux/leak_check_segv.cc
2

Done

test/asan/TestCases/Linux/signal_during_stop_the_world.cc
5

Done

37

no strong reason
it is reasonable if SIGCHLD unblocks waitpid
but on my system it seems that SIGCHLD does not, but SIGPROF does
added comment

43

Removed this part

54

Episodic crashes are fine with me.
Still much better than the current absence of interesting tests.

earthdok added inline comments.Mar 3 2015, 8:30 AM
lib/sanitizer_common/sanitizer_stoptheworld_linux_libcdep.cc
416

As discussed offline: existing code only works correctly (on other platforms) under the assumption that signals are blocked => waitpid does not return until tracer is dead.

test/asan/TestCases/Linux/signal_during_stop_the_world.cc
54

As discussed offline, it should actually crash as soon as the ScopedStackSpaceWithGuard object is destroyed. I retract my statement. May be worth a comment.

hans added a subscriber: hans.Mar 3 2015, 10:51 AM

Fixed the errno issue by spin-waiting on completion of the tracer thread.

earthdok edited edge metadata.Mar 4 2015, 11:16 AM

We need a comment in sanitizer_stoptheworld.h documenting the fact that the callback must exit by returning and not by calling exit() or _exit().

For the record, this is what currently happens when the tracer dies for any reason:

If  the  tracer  dies,  all  tracees  are  automatically  detached  and
restarted,  unless  they  were in group-stop.  Handling of restart from
group-stop is currently buggy, but the  "as  planned"  behavior  is  to
leave  tracee  stopped  and  waiting  for  SIGCONT.   If  the tracee is
restarted from signal-delivery-stop, the pending signal is injected.
lib/sanitizer_common/sanitizer_stoptheworld_linux_libcdep.cc
208

what is the purpose of these changes?

229–231

same question

414

internal_waitpid() may call syscall() which can access/spoil...

dvyukov updated this revision to Diff 21273.Mar 5 2015, 6:36 AM
dvyukov edited edge metadata.

We need a comment in sanitizer_stoptheworld.h documenting the fact that the callback must exit by returning and not by calling exit() or _exit().

done

lib/sanitizer_common/sanitizer_stoptheworld_linux_libcdep.cc
208

I want to set inst->arg->done=1 as the last action before internal__exit. But I also need to set thread_suspender_instance to NULL. This means that I need a copy of thread_suspender_instance in a local var.

229–231

Just to reduce length of overly lengthy identifiers.

414

done

Still kinda wary about this approach, but let's give it a try. LGTM

lib/sanitizer_common/sanitizer_stoptheworld.h
63 ↗(On Diff #21273)

nit: could you please be consistent in "function" vs "function()" in comments

lib/sanitizer_common/sanitizer_stoptheworld_linux_libcdep.cc
414

same

fixed in rev 231457

dvyukov abandoned this revision.Apr 15 2021, 1:20 AM