Fixes the issue similar to the one mentioned in D6478.
Details
Diff Detail
Event Timeline
Why does tsan deadlock? Does the program actually deadlock in tsan runtime?
Without this sleep we enter CollectGarbage() before BusyThread() is called (the numbers on the left side are what clock_gettime(CLOCK_REALTIME_PRECISE) returns):
1417692217.980580462: Init() 1417692217.987495307: pthread_create() 1417692218.54266293: CollectGarbage() 1417692218.54445716: in CollectGarbage() 1417692218.54574294: in StopWorld() LLVMSymbolizer: error reading file: No such file or directory. ================== WARNING: ThreadSanitizer: data race (pid=23947) Write of size 8 at 0x7db000000000 by thread T1: #0 memcpy /home/buildslave/ikosarev/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:640 ... SUMMARY: ThreadSanitizer: data race ??:0 ?? ================== Stack dump: 0. Program arguments: /home/buildslave/ikosarev/build/./bin/llvm-symbolizer --inlining=true --default-arch=x86_64
With sleep(1) it is:
1417692556.835742780: Init() 1417692556.848583448: pthread_create() 1417692556.896431988: in BusyThread() 1417692556.896830432: in BusyThread() loop 1417692556.897031854: in BusyThread() loop ... 1417692557.894940248: in BusyThread() loop LLVMSymbolizer: error reading file: No such file or directory. ================== WARNING: ThreadSanitizer: data race (pid=22303) Write of size 8 at 0x7db000000000 by main thread: #0 memcpy /home/buildslave/ikosarev/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:640 ... SUMMARY: ThreadSanitizer: data race ??:0 ?? ================== 1417692557.894964692: CollectGarbage() 1417692558.719659730: in BusyThread() loop 1417692558.740268355: in CollectGarbage() 1417692558.740936178: in BusyThread() loop 1417692558.744038178: in StopWorld() 1417692558.744717244: in BusyThread() loop 1417692558.746884000: in StartWorld() 1417692558.749903867: pthread_join() DONE ThreadSanitizer: reported 2 warnings
So it looks like we call pthread_kill() for an under-initialized thread. We probably shouldn't quit the pthread_create() interceptor unless the thread's routine took control?
Does the following change fix the issue? With this change all tsan state is initialized when pthread_create returns as far as I see. It would be a much better fix.
Index: rtl/tsan_interceptors.cc
- rtl/tsan_interceptors.cc (revision 222341)
+++ rtl/tsan_interceptors.cc (working copy)
@@ -882,8 +882,8 @@
ThreadIgnoreEnd(thr, 0); while ((tid = atomic_load(&p->tid, memory_order_acquire)) == 0) pthread_yield();
+ ThreadStart(thr, tid, GetTid());
atomic_store(&p->tid, 0, memory_order_release);
- ThreadStart(thr, tid, GetTid()); } void *res = callback(param); // Prevent the callback from being tail called,
With this change we sometimes pass the test, but in most of the cases we still call CollectGarbage() before control reach BusyThread() and then deadlock.
I still don't understand how/why it deadlocks w/o the sleep and how sleep affects it.
With my change, the new thread must be calling the user thread function when pthread_create returns. So basically the new thread is already running from tsan point in view.
OK, the problem is that sometimes StartWorld() does pthread_kill(kSigRestart) before the suspend handler calls sigsuspend(). In fact, it tries to restart the world even before the first sem_post() call returns control. So what we need is some delay between StopWorld() and StartWorld(). I'm going to determine what duration would make the buildbot happy and then prepare an update.
A more reliable solution would be to remove sigfillset(&act.sa_mask), and change RestartHandler to:
void RestartHandler() {
sem_post(&g_thread_suspend_ack_sem) sem_post(&g_thread_suspend_ack_sem)
}
and then do sem_wait(&g_thread_suspend_ack_sem) both in StartWorld and in SuspendHandler.
But I am not sure that it will not deadlock in some other way. So let's try the sleep first.
Updated. The idea behind process_sleep() is to count process time and thus make the test somewhat less dependent on system load. I've put it in a header so we can use it in other tests.
test/tsan/process_sleep.h | ||
---|---|---|
4 ↗ | (On Diff #17035) | Humm... clock() returns CPU time used by the process. |
test/tsan/signal_recursive.cc | ||
43 | Please remove this line as well. It not only pointless, but also wrong. If the signal comes before this line, we lose the signal and deadlock. |
test/tsan/process_sleep.h | ||
---|---|---|
4 ↗ | (On Diff #17035) | The problem with sleep() is that on systems operating under somewhat stress conditions there is no guarantee the threads are synchronized as we expect it, so even with sleep(4) we still get failures on the buildbot with this test from time to time. Moreover, this test is not the only one that fail only somtimes due to use of sleep(). (However, I'm not sure we are absolutely free of this kind of failures when using the clock() func; we could give it a try and see.) Anyway, if, for whatever reason, this approach is not acceptable, then we probably should state it explicitly that Tsan tests require a dedicated buildbot by their design and nature. |
test/tsan/signal_recursive.cc | ||
---|---|---|
43 | It didn't worked. In all cases when the signal comes before sigsuspend(), it deadlocks. Sigh... :-) |
Committed with the usleep() delay reduced so it take less time to pass on normal run.
Please remove this line as well. It not only pointless, but also wrong. If the signal comes before this line, we lose the signal and deadlock.
Maybe it will fix the test w/o sleep...