This patch addresses https://github.com/google/sanitizers/issues/774
When we fork a multi-threaded process it's possible to deadlock if some thread acquired StackDepot or allocator internal lock just before fork. In this case the lock will never be released in child process causing deadlock on following memory alloc/dealloc routine.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
I'm trying to cook. Need to have concurrent fork with malloc/free and the StackDepot lock to be acquired when fork is called. But the window is too small :(. Can we widen it somehow?
Look at test/msan/fork.cc, it should be possible to replace copy_uninit_thread2 with an allocation loop.
Updating. When adapting test/msan/fork.cc for ASan I've noticed that deadlock can also happen if the whole allocator is locked when we calling fork. Corresponding backtrace looks like this:
#0 atomic_exchange<__sanitizer::atomic_uint32_t> () at /home/max/src/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_atomic_clang.h:68 #1 Lock () at /home/max/src/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_linux.cc:541 #2 0x000000000041e9f9 in GenericScopedLock () at /home/max/src/llvm/projects/compiler-rt/lib/asan/../sanitizer_common/sanitizer_mutex.h:187 #3 GetFromAllocator () at /home/max/src/llvm/projects/compiler-rt/lib/asan/../sanitizer_common/sanitizer_allocator_primary64.h:134 #4 0x000000000041e99b in Refill () at /home/max/src/llvm/projects/compiler-rt/lib/asan/../sanitizer_common/sanitizer_allocator_local_cache.h:108 #5 0x000000000041e550 in Allocate () at /home/max/src/llvm/projects/compiler-rt/lib/asan/../sanitizer_common/sanitizer_allocator_local_cache.h:50 #6 0x000000000041e443 in Allocate () at /home/max/src/llvm/projects/compiler-rt/lib/asan/../sanitizer_common/sanitizer_allocator_combined.h:68 #7 0x000000000041b6cb in Allocate () at /home/max/src/llvm/projects/compiler-rt/lib/asan/asan_allocator.cc:407 #8 0x00000000004b9148 in __interceptor_malloc () at /home/max/src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:67 #9 0x00000000004e878d in child () at fork.cc:47 #10 0x00000000004e8968 in test () at fork.cc:67 #11 0x00000000004e8a38 in main () at fork.cc:78
Thus we need to lock allocator as well.
Good catch. This is a problem for MSan as well, is not it? Do you mind fixing it in both tools?
For bonus points, TSan. I see it locks something
void ForkBefore(ThreadState *thr, uptr pc) {
ctx->thread_registry->Lock(); ctx->report_mtx.Lock();
}
But again, not the allocator.
Sure, we can fix it for all sanitizers if the problem exists, I'll check this more carefully. FWIW I've run the attached testcase with MSan and TSan, but it didn't hang (as opposed to ASan). Do they use allocator caches (that can lock on refilling)?
For bonus points, TSan. I see it locks something
void ForkBefore(ThreadState *thr, uptr pc) {ctx->thread_registry->Lock(); ctx->report_mtx.Lock();}
But again, not the allocator.
Updating. Lock allocator on fork for {A, M, T}San and moving testcase to sanitizer_common. I've also managed to reproduce the deadlock with MSan, but still not with TSan.
lib/msan/msan_interceptors.cc | ||
---|---|---|
1231 ↗ | (On Diff #99781) | just a nit: ASan takes these locks in a different order, please be consistent. |
test/sanitizer_common/TestCases/Linux/allocator_fork_no_hang.cc | ||
32 ↗ | (On Diff #99781) | A volatile or -O0 flag would not hurt to make sure this is not optimized out. |
41 ↗ | (On Diff #99781) | What are these hash table cells? |
test/sanitizer_common/TestCases/Linux/allocator_fork_no_hang.cc | ||
---|---|---|
41 ↗ | (On Diff #99781) | Oh, this comment reflects to StackDepotLockAll(). I'll remove it. |
Updating. Reproduced the deadlock with TSan and LSan on adjusted testcase. Adding Dmitry since the patch now affects TSan code as well.
I don't completely understand this. allocator_fork_no_hang.cc is broken and deserves deadlocking. Why are we trying to fix this program only under sanitizers rather than detecting and reporting the bug? Sanitizers are meant to be less forgiving than production environment. Why are we trying to conceal the bug?
Yes, the testcase deserves deadlocking (malloc is not async signal safe). And the actual bug is reported here: https://bugzilla.gnome.org/show_bug.cgi?id=738620.
The motivation for "fixing" this in sanitizers is that most of modern allocators (Glibc, tcmalloc, jemalloc) are actually fork safe, so why not to do the same in sanitizers. And despite the fact that calling malloc after multi-threaded fork is not allowed, many people still would expect their code to work (since it works with Glibc/tcmalloc/jemalloc etc).
lib/tsan/rtl/tsan_dense_alloc.h | ||
---|---|---|
94 ↗ | (On Diff #99890) | This is no more than Lock, please call it Lock. ForceLock produces a false impression that this is something more than just Lock. |
98 ↗ | (On Diff #99890) | Same here. |
lib/tsan/rtl/tsan_rtl.cc | ||
458 ↗ | (On Diff #99890) | Also StackDepotLockAll. |
lib/tsan/rtl/tsan_sync.h | ||
128 ↗ | (On Diff #99890) | Same here. |
130 ↗ | (On Diff #99890) | Same here. |
lib/tsan/rtl/tsan_rtl.cc | ||
---|---|---|
460 ↗ | (On Diff #99924) | Now looking at this more closely and paging in some context. |
Removing TSan part. The attached testcase still hangs with TSan, but since locking allocator is problematic let's leave TSan alone.
This change is causing problems on Linux. The immediate issue is Tcl, which creates new threads from a pthread_atfork() child handler. The handler is called from libc's fork(), and pthread_create tries to acquire the StackDepot lock before it is released in the fork interceptor.
Tcl is broken, of course:
If a fork() call in a multi-threaded process leads to a child fork handler calling any function that is not async-signal-safe, the behavior is undefined.
But in general, even malloc() called from pthread_atfork() will deadlock with this change.
It looks like the right way to do this is call pthread_atfork() as early as possible to setup lock/unlock handlers for the allocator and the stack depot.
I'll revert this change in the meantime.
Oh, sorry.
Won't this deadlock with MSan as well? AFAIK it also locks StackDepot in fork interceptor...
Tcl is broken, of course:
If a fork() call in a multi-threaded process leads to a child fork handler calling any function that is not async-signal-safe, the behavior is undefined.But in general, even malloc() called from pthread_atfork() will deadlock with this change.
It looks like the right way to do this is call pthread_atfork() as early as possible to setup lock/unlock handlers for the allocator and the stack depot.
Perhaps we can do it from asan_init? From man:
"The order of calls to pthread_atfork() is significant. The parent and child fork handlers shall be called in the order in which they were established by calls to pthread_atfork(). The prepare fork handlers shall be called in the opposite order."
So in this case our unlock routines will be called before others. This still can deadlock when preloading asan to noinstrumented binary though.
I'll revert this change in the meantime.
Maybe it does not use StackDepot in pthread_create?
Tcl is broken, of course:
If a fork() call in a multi-threaded process leads to a child fork handler calling any function that is not async-signal-safe, the behavior is undefined.But in general, even malloc() called from pthread_atfork() will deadlock with this change.
It looks like the right way to do this is call pthread_atfork() as early as possible to setup lock/unlock handlers for the allocator and the stack depot.Perhaps we can do it from asan_init? From man:
"The order of calls to pthread_atfork() is significant. The parent and child fork handlers shall be called in the order in which they were established by calls to pthread_atfork(). The prepare fork handlers shall be called in the opposite order."So in this case our unlock routines will be called before others. This still can deadlock when preloading asan to noinstrumented binary though.
Sounds like that's the best we can do. If we are lucky, sanitizer initialization will happen early enough due to ENSURE_ASAN_INIT in one of the interceptors.
I'll revert this change in the meantime.
Hm, perhaps we can just intercept pthread_atfork and force ENSURE_ASAN_INITED? And install allocator hooks in AsanInitFromRtl?
I'll revert this change in the meantime.