This is an archive of the discontinued LLVM Phabricator instance.

[Tsan] Fix the signal_recursive.cc test to pass on systems with high loads
ClosedPublic

Authored by kutuzov.viktor.84 on Dec 3 2014, 8:22 AM.

Diff Detail

Repository
rL LLVM

Event Timeline

kutuzov.viktor.84 retitled this revision from to [Tsan] Fix the signal_recursive.cc test to pass on systems with high loads.
kutuzov.viktor.84 updated this object.
kutuzov.viktor.84 edited the test plan for this revision. (Show Details)
kutuzov.viktor.84 added a subscriber: Unknown Object (MLST).
dvyukov edited edge metadata.Dec 3 2014, 8:46 AM

Why does tsan deadlock? Does the program actually deadlock in tsan runtime?

Why does tsan deadlock? Does the program actually deadlock in tsan runtime?

Without this sleep we enter CollectGarbage() before BusyThread() is called (the numbers on the left side are what clock_gettime(CLOCK_REALTIME_PRECISE) returns):

1417692217.980580462: Init()
1417692217.987495307: pthread_create()
1417692218.54266293: CollectGarbage()
1417692218.54445716: in CollectGarbage()
1417692218.54574294: in StopWorld()
LLVMSymbolizer: error reading file: No such file or directory.
==================
WARNING: ThreadSanitizer: data race (pid=23947)
  Write of size 8 at 0x7db000000000 by thread T1:
    #0 memcpy /home/buildslave/ikosarev/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:640
...
SUMMARY: ThreadSanitizer: data race ??:0 ??
==================
Stack dump:
0.      Program arguments: /home/buildslave/ikosarev/build/./bin/llvm-symbolizer --inlining=true --default-arch=x86_64

With sleep(1) it is:

1417692556.835742780: Init()
1417692556.848583448: pthread_create()
1417692556.896431988: in BusyThread()
1417692556.896830432: in BusyThread() loop
1417692556.897031854: in BusyThread() loop
 ...
1417692557.894940248: in BusyThread() loop
LLVMSymbolizer: error reading file: No such file or directory.
==================
WARNING: ThreadSanitizer: data race (pid=22303)
  Write of size 8 at 0x7db000000000 by main thread:
    #0 memcpy /home/buildslave/ikosarev/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:640
...
SUMMARY: ThreadSanitizer: data race ??:0 ??
==================
1417692557.894964692: CollectGarbage()
1417692558.719659730: in BusyThread() loop
1417692558.740268355: in CollectGarbage()
1417692558.740936178: in BusyThread() loop
1417692558.744038178: in StopWorld()
1417692558.744717244: in BusyThread() loop
1417692558.746884000: in StartWorld()
1417692558.749903867: pthread_join()
DONE
ThreadSanitizer: reported 2 warnings

So it looks like we call pthread_kill() for an under-initialized thread. We probably shouldn't quit the pthread_create() interceptor unless the thread's routine took control?

Does the following change fix the issue? With this change all tsan state is initialized when pthread_create returns as far as I see. It would be a much better fix.

Index: rtl/tsan_interceptors.cc

  • rtl/tsan_interceptors.cc (revision 222341)

+++ rtl/tsan_interceptors.cc (working copy)
@@ -882,8 +882,8 @@

ThreadIgnoreEnd(thr, 0);
while ((tid = atomic_load(&p->tid, memory_order_acquire)) == 0)
  pthread_yield();

+ ThreadStart(thr, tid, GetTid());

atomic_store(&p->tid, 0, memory_order_release);
  • ThreadStart(thr, tid, GetTid()); } void *res = callback(param); // Prevent the callback from being tail called,

Awesome patch formatting. Thank you, reviews.llvm.org.

With this change we sometimes pass the test, but in most of the cases we still call CollectGarbage() before control reach BusyThread() and then deadlock.

test/tsan/signal_recursive.cc
67 ↗(On Diff #16866)

My man pages (both the Linux and FreeBSD ones) for sem_wait() read that it returns (-1) on error so I believe it should be:

if (errno != EINTR)
79 ↗(On Diff #16866)

Same here.
It doesn't affect passing the test on the buildbot, but though.

I still don't understand how/why it deadlocks w/o the sleep and how sleep affects it.
With my change, the new thread must be calling the user thread function when pthread_create returns. So basically the new thread is already running from tsan point in view.

OK, the problem is that sometimes StartWorld() does pthread_kill(kSigRestart) before the suspend handler calls sigsuspend(). In fact, it tries to restart the world even before the first sem_post() call returns control. So what we need is some delay between StopWorld() and StartWorld(). I'm going to determine what duration would make the buildbot happy and then prepare an update.

This comment was removed by dvyukov.

A more reliable solution would be to remove sigfillset(&act.sa_mask), and change RestartHandler to:

void RestartHandler() {

sem_post(&g_thread_suspend_ack_sem)
sem_post(&g_thread_suspend_ack_sem)

}

and then do sem_wait(&g_thread_suspend_ack_sem) both in StartWorld and in SuspendHandler.

But I am not sure that it will not deadlock in some other way. So let's try the sleep first.

kutuzov.viktor.84 edited edge metadata.

Updated. The idea behind process_sleep() is to count process time and thus make the test somewhat less dependent on system load. I've put it in a header so we can use it in other tests.

dvyukov added inline comments.Dec 8 2014, 1:49 PM
test/tsan/process_sleep.h
4 ↗(On Diff #17035)

Humm... clock() returns CPU time used by the process.
It looks somewhat dangerous to use in most of the tests. If the process is blocked it does not consume CPU time. Of course, this very loop consumes a dash of CPU time every millisecond... But it's not what we want to wait for.
sleep(1) worked well for us all other tests. I would rather it here as well.

test/tsan/signal_recursive.cc
45 ↗(On Diff #17035)

Please remove this line as well. It not only pointless, but also wrong. If the signal comes before this line, we lose the signal and deadlock.
Maybe it will fix the test w/o sleep...

test/tsan/process_sleep.h
4 ↗(On Diff #17035)

The problem with sleep() is that on systems operating under somewhat stress conditions there is no guarantee the threads are synchronized as we expect it, so even with sleep(4) we still get failures on the buildbot with this test from time to time. Moreover, this test is not the only one that fail only somtimes due to use of sleep(). (However, I'm not sure we are absolutely free of this kind of failures when using the clock() func; we could give it a try and see.) Anyway, if, for whatever reason, this approach is not acceptable, then we probably should state it explicitly that Tsan tests require a dedicated buildbot by their design and nature.

test/tsan/signal_recursive.cc
45 ↗(On Diff #17035)

It didn't worked. In all cases when the signal comes before sigsuspend(), it deadlocks. Sigh... :-)

dvyukov accepted this revision.Dec 11 2014, 7:21 AM
dvyukov edited edge metadata.

LGTM
let's move forward

This revision is now accepted and ready to land.Dec 11 2014, 7:21 AM
This revision was automatically updated to reflect the committed changes.

Committed with the usleep() delay reduced so it take less time to pass on normal run.