Page MenuHomePhabricator

Fiber support for thread sanitizer
ClosedPublic

Authored by yuri on Nov 26 2018, 2:07 AM.

Details

Reviewers
dvyukov
blastrock
Summary

This patch adds functions for managing fibers:

  • __tsan_get_current_fiber()
  • __tsan_create_fiber()
  • __tsan_destroy_fiber()
  • __tsan_switch_to_fiber()
  • __tsan_set_fiber_name()

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
dvyukov added inline comments.Dec 11 2018, 6:28 AM
lib/tsan/rtl/tsan_rtl_thread.cc
249–250

Let's not abuse an existing variable for this. I am not sure no OS passes 0 here today, or in future somebody will decide to pass 0 as they don't have anything meaningful to pass as os_id. Let's add another flag for this.

741g added a comment.Dec 11 2018, 9:46 AM

Lingfeng, thanks for extensive feedback.

This should allow that logic to work for QEMU at least, and it feels like it should work for all possible other uses (?)

I am not sure there are no programs that use setjmp/longjmp to switch fibers across threads. Yes, man says behavior is undefined. But strictly saying, longjmp shouldn't be used for fiber switching too (it's solely for jumping upwards on the same stack). So I feel people are just relying on implementation details here (let's test it, works, ok, let's use it).

You said that QEMU uses sigsetjmp/siglongjmp for coroutine switches. How does a coroutine always stay on the same thread then? Say, if they have a thread pool servicing all coroutines, I would expect that coroutines are randomly intermixed across threads. So it can happen that we setjmp on one thread, but then reschedule the coroutine with longjmp on another thread.

Thanks for pointing that out. I have to check more closely, but I haven't observed migration of fibers across physical threads at all in qemu. The tendency is to both create and switch the coroutine on a dedicated thread. Swapcontext is done to a trampoline function, to which one can return again later on. Kind of like a fiber pool, where reuse of a fiber means longjmping to the trampoline instead of another swapcontext.

I don't think there is a thread pool that mixes around the fibers. I guess this means we would have to annotate in the more general use case, but for QEMU, if we make the assumption that setjmp/longjmp follow the spec, while only allowing for physical thread switches in swapcontext, we should be OK (And physical thread switching in swapcontext is somethign QEMU does not do).

Lingfeng, thanks for extensive feedback.

This should allow that logic to work for QEMU at least, and it feels like it should work for all possible other uses (?)

I am not sure there are no programs that use setjmp/longjmp to switch fibers across threads. Yes, man says behavior is undefined. But strictly saying, longjmp shouldn't be used for fiber switching too (it's solely for jumping upwards on the same stack). So I feel people are just relying on implementation details here (let's test it, works, ok, let's use it).

You said that QEMU uses sigsetjmp/siglongjmp for coroutine switches. How does a coroutine always stay on the same thread then? Say, if they have a thread pool servicing all coroutines, I would expect that coroutines are randomly intermixed across threads. So it can happen that we setjmp on one thread, but then reschedule the coroutine with longjmp on another thread.

Thanks for pointing that out. I have to check more closely, but I haven't observed migration of fibers across physical threads at all in qemu. The tendency is to both create and switch the coroutine on a dedicated thread. Swapcontext is done to a trampoline function, to which one can return again later on. Kind of like a fiber pool, where reuse of a fiber means longjmping to the trampoline instead of another swapcontext.

I don't think there is a thread pool that mixes around the fibers. I guess this means we would have to annotate in the more general use case, but for QEMU, if we make the assumption that setjmp/longjmp follow the spec, while only allowing for physical thread switches in swapcontext, we should be OK (And physical thread switching in swapcontext is somethign QEMU does not do).

Makes sense.

If somebody will want to do longjmp to switch fibers across threads, we will need some additional support in tsan for that. Let's stick with Yuri and your use cases for now.

yuri updated this revision to Diff 178459.Dec 17 2018, 7:10 AM
  • By default sync fibers with each other on switch
  • Faster cur_thread(), no visible difference in benchmark results
  • Moved implementation to tsan_rtl_thread.cc
yuri added a comment.Dec 17 2018, 7:15 AM

Hi Dmitry,
I will try to answer your questions

  1. Do we want to synchronize fibers on switch? If fibers use TLS data, or cooperative scheduling (e.g. mutex unlock directly schedules the next critical section owner), or pass some data from prev fiber to next fiber, in all these cases not synchronizing fibers on switch will lead to false positives. As far as I remember in my old experiments I concluded that it's almost impossible to get rid of false positives it tasks/fibers are not synchronized on switch.

In case of single-threaded scheduler, it is worth to synchronize fibers on each switch. Currently I do it in client code using _tsan_release()/_tsan_acquire(), but it is possible to add flag for _tsan_switch_to_fiber() that will do it.

In case of multithreaded scheduler, client probably has own sychronization primitives (for example, custom mutex), and it is client's responsibility to call corresponding TSAN functions.

I think that synchronization will be exactly in the wrong place.
Consider, a thread pushes a fiber for execution on a thread pool. It locks a mutex and adds the fiber onto run queue. Then a thread pool thread dequeues the fiber from the run queue under the same mutex. Now the first thread and the thread pool thread are synchronized. Now the thread pool switches to the fiber context, no synchronization happens, so the fiber is _not_ synchronized with the submitting thread. This means that any passed data will cause false positives.
Even if the fiber is just created, in the above scheme even reading fiber entry function arguments (and even just writing to the stack) will cause false positives.

If fiber switch will synchronize parent thread and the fiber, then we will get the desired transitive synchronization.

Looking at the race reports produces by the qemu prototype, it seems there are actually all kinds of these false reports. There are false races on TLS storage. There are races on fiber start function. There are numerous races on arguments passed to fibers.

You said that for your case you need the synchronization, and it seems that it's also needed for all other cases too. I think we need to do it. People will get annotations wrong, or don't think about acquire/release at all, and then come to mailing list with questions about all these races. If it will prove to be needed, we can always add an opt-out of synchronization with a new flag for the annotations.

I added default synchronization and flag to opt-out. It would be great if someone can check this version with QEMU.

yuri added a comment.Dec 17 2018, 7:22 AM
  1. This introduces slowdown into the hottest runtime functions (branch and indirection in cur_thread). I think check_analyze.sh test will fail. Kinda hard to say, because it's broken already at the moment, but this perturbs runtime functions enough:

I have checked parts of the patch separately and found:

  • Part of patch related to HACKY_CALL does not affect code of critical functions and benchmark results
  • Part of patch related to proc() does not affect code of critical functions and benchmark results
  • The only part that affects critical functions is implementation cur_thread(). I tried to minimize its effect in new version of a patch.

Yes, cur_thread() is the problem. It's used in all the hottest functions.
As far as I see the current code still contains an indirection in cur_thread(). We need to think of a way to not affect hot paths.

New version should be as fast as original code.

  1. Looking at 1, 2, 3, 7 and 8, it seems that we may want to switch some smaller part (subset) of ThreadState. E.g. introduce TaskContext that contains only the part that we need to switch on fiber switches and then have: ThreadState -> TaskContext (of physical thread, always linked) -> TaskContext (current fiber context) or: ThreadState -> TaskContext (of physical thread, always linked) -> TaskContext (of a task) -> TaskContext (of a nested task) E.g. clock can stay in ThreadState (if we want to synchronize them) which will solve the performance problem, jmp_bufs can stay which will solve part of setjmp problem. This may resolve the hacky_call problem as well.

IMHO it is hardly possible function entry/exit use shadow stack, memory access functions use clock. Both structure are too big to be copied on each fiber switch.

I prefer not to do it:

  • There is already similar separation ThreadState/Process. If something is not needed in ThreadState, it can be moved to Process.

You mean Processor?
Processor is completely different thing, it does not have any relation to user-visible state, it's merely a helper and implementation detail that makes internal machinery more efficient. ThreadState is a user-visible entity. Nothing can be moved from ThreadState to Processor (except for stuff that is meant to be in Processor, but wasn't moved with the introduction of Processor, but that's a different story, it does not relate to what we discuss here).

  • In case of single-threaded scheduler the main thing to switch is shadow stack. In case of multithreaded scheduler nearly everything should be switched. I do not think it is good idea to have two different implementations for these cases.

Not if we always synchronize on fiber switch.
If we synchronize then clock does not need to be switched. And clock is exactly the thing that we need embed in TLS to not slow down hot paths.

  • I tried to be on-par with Go sanitizer, which uses ThreadState for each goroutine and single Process per physical thread

Go is different. Processor is not user visible, ThreadState is.
Go model is exactly the same as we have for C/C++ without this change: both have only 1 user-visible entity: thread. With this change C/C++ has 2 user-visible entities: thread and fiber.

Can't agree with this. Only fiber is visible to user. Of cause, it is the same as thread until user switches it.

yuri added a comment.EditedDec 17 2018, 7:38 AM
  1. I think instead of _tsan_get_current_fiber we could use pointer 0. It prevents things like switching to context of another thread, which won't end well. And just simplifies API surface.

For us possibility to use context of another thread is required feature. We use it to call into library which use fibers from external finber-unaware code.
It is on-par with Windows API for fibers - ConvertThreadToFiber().

Tell me more.
You have context for all fibers, tsan_create_fiber returns it.
_tsan_get_current_fiber is only useful for the physical thread. Do you use
tsan_switch_to_fiber to switch from a fiber to a non-current physical thread? I don't see how this can work. Or what do you do?

In standalone thread we create temporary context and switch into into it. Then original thread context is posted into event loop thread, where it runs for some time together with other fibers. At some point code decides that it should migrate to original thread. Context is removed from list of fibers and original thread (running temporary context) is signalled. Thread switches into its own context and destroys temporary context.

This trick significantly simplifies implementation of bindings for library that uses fibers internally and improves debugability because stack traces are preserved after switch.

  1. What if a fiber calls pthread_exit?

I think, it's enough to add check and abort until someone will really need support for it

Let's add a check and a test then. These things tend to lead to very obscure failure modes that take days to debug on large user programs.

Currently there is no interceptor for pthread_exit(). Do you suggest to add it?

  1. Can we combine _tsan_set_fiber_name with _tsan_create_fiber?

It will not solve our requirements. It is convenient to set fiber name from the fiber itself and even change it sometimes, analogously to pthread_setname_np().

I see. Let's leave it as separate function. I guess it's more flexible and will cover all possible use cases.

  1. Do we want to support nested fibers/tasks? Does GCD have nested tasks? I think TBB/OpenMP does. Won't begin/end fiber/task be a more generic API for this? E.g. fiber switching can be expressed as stop running the prev fiber (which switches back to the physical thread context), then start running new fiber.

GCD has no nested task. GCD tasks can start other tasks, but can't switch or wait.

For me current API looks more generic then what you suggest.
Also, switch to physical thread context may be problematic in some case (see answer to question 3).

Let's leave tasks aside for now. It's quite complex without tasks already.

  1. How does this play with setjmp/longjmp? In particular with fiber switching based on setjmp/longjmp? jmp_bufs is part of ThreadState, won't we always try to do longjmp from a wrong ThreadState in such case?

My idea is when setjmp/longjmp is used for fiber switch, client should disable interceptors before call and enable them after it.

What do you mean by "disable interceptors"? Do we have such thing?

This procedure can be simplified by flag of _tsan_switch_to_fiber().

How? setjmp/longjmp happen outside of __tsan_switch_to_fiber.

Isn't it enough to place _tsan_ignore_thread_begin()/_tsan_ignore_thread_end() around setjmp/longjmp calls?

dvyukov added inline comments.Dec 18 2018, 7:18 AM
lib/tsan/rtl/tsan_rtl_thread.cc
249–250

Still holds.

Can't agree with this. Only fiber is visible to user. Of cause, it is the same as thread until user switches it.

Whatever we call them, Processor is for different things. It must not hold anything related to user state.

In standalone thread we create temporary context and switch into into it. Then original thread context is posted into event loop thread, where it runs for some time together with other fibers.

Oh, I see. I guess it can have some rough edges if used incorrectly, but also probably can work if used carefully.
OK, let's leave __tsan_get_current_fiber alone.

Currently there is no interceptor for pthread_exit(). Do you suggest to add it?

Yes.

Isn't it enough to place _tsan_ignore_thread_begin()/_tsan_ignore_thread_end() around setjmp/longjmp calls?

These only ignore memory accesses, they don't affect interceptor operation as far as I see.

I don't completely follow logic behind cur_thread/cur_thread_fast/cur_thread1 and how it does not introduce slowdown.
cur_thread_fast still includes an indirection, so it looks exactly the same as the first version.
cur_thread_fast does not check for NULL, so I would expect it to be used only in few carefully chosen functions. But it's used in most entry functions. Which makes me wonder: when cur_thread_fast cannot be used if it can be used in all these places?
It seems that you use cur_thread only to lazily initialize cur_thread1 in Initialize or ThreadStart. But then the question is: why don't we explicitly intialize cur_thread1 in these functions, and then just have a single cur_thread function as before with no 2 subtly different accessors?

dvyukov added inline comments.Dec 18 2018, 7:32 AM
lib/tsan/rtl/tsan_rtl.h
886

It seems it should be SwtchFiberFlags flags. Or why do we need SwtchFiberFlags?

lib/tsan/rtl/tsan_rtl_thread.cc
417

This code is an abstract core and must not use cur_thread.
Pass ThreadState *thr, uptr pc as first args to all these functions as all runtime functions do.

I added default synchronization and flag to opt-out.

Can you swicth to it in your codebase? I would expect that you need all (or almost all) of this synchronization anyway?

In standalone thread we create temporary context and switch into into it. Then original thread context is posted into event loop thread, where it runs for some time together with other fibers.

Please add such test (or we will break it in future).

IMHO it is hardly possible function entry/exit use shadow stack, memory access functions use clock. Both structure are too big to be copied on each fiber switch.

This is a good question.
Shadow stack seems to use an indirection already.
Clock copying is a sequential memory copy. If handling of memory accesses/function entry/exit becomes a bit slower, then a thousand of memory accesses can outweight cost of copying clock.
A good question is: how much slower is the code with the indirection in cur_thread?

FTR, here is current code:

00000000004b2c50 <__tsan_read2>:
  4b2c50:       48 b8 f8 ff ff ff ff    movabs $0xffff87fffffffff8,%rax
  4b2c57:       87 ff ff 
  4b2c5a:       48 ba 00 00 00 00 00    movabs $0x40000000000,%rdx
  4b2c61:       04 00 00 
  4b2c64:       53                      push   %rbx
  4b2c65:       48 21 f8                and    %rdi,%rax
  4b2c68:       48 8b 74 24 08          mov    0x8(%rsp),%rsi
  4b2c6d:       48 31 d0                xor    %rdx,%rax
  4b2c70:       48 83 3c 85 00 00 00    cmpq   $0xffffffffffffffff,0x0(,%rax,4)
  4b2c77:       00 ff 
  4b2c79:       0f 84 9d 00 00 00       je       4b2d1c <__tsan_read2+0xcc>
  4b2c7f:       49 c7 c0 c0 04 fc ff    mov    $0xfffffffffffc04c0,%r8
  4b2c86:       64 49 8b 08             mov    %fs:(%r8),%rcx
  4b2c8a:       48 85 c9                test   %rcx,%rcx
  4b2c8d:       0f 88 89 00 00 00       js       4b2d1c <__tsan_read2+0xcc>
...
  4b2cf6:       64 f3 41 0f 7e 50 08    movq   %fs:0x8(%r8),%xmm2
...
  4b2d36:       64 49 89 10             mov    %rdx,%fs:(%r8)

Here is new code:

00000000004b8460 <__tsan_read2>:
  4b8460:       48 b8 f8 ff ff ff ff    movabs $0xffff87fffffffff8,%rax
  4b8467:       87 ff ff 
  4b846a:       48 ba 00 00 00 00 00    movabs $0x40000000000,%rdx
  4b8471:       04 00 00 
  4b8474:       53                      push   %rbx
  4b8475:       48 21 f8                and    %rdi,%rax
  4b8478:       48 8b 74 24 08          mov    0x8(%rsp),%rsi
  4b847d:       48 31 d0                xor    %rdx,%rax
  4b8480:       48 83 3c 85 00 00 00    cmpq   $0xffffffffffffffff,0x0(,%rax,4)
  4b8487:       00 ff 
  4b8489:       0f 84 9f 00 00 00       je       4b852e <__tsan_read2+0xce>
  4b848f:       48 c7 c2 f8 ff ff ff    mov    $0xfffffffffffffff8,%rdx
  4b8496:       64 4c 8b 02             mov    %fs:(%rdx),%r8
  4b849a:       49 8b 08                mov    (%r8),%rcx
  4b849d:       48 85 c9                test   %rcx,%rcx
  4b84a0:       0f 88 88 00 00 00       js       4b852e <__tsan_read2+0xce>
...
  4b8509:       f3 41 0f 7e 50 08       movq   0x8(%r8),%xmm2
...
  4b8546:       49 89 10                mov    %rdx,(%r8)

The additional indirection is "mov (%r8),%rcx".

741g added a comment.Dec 18 2018, 8:25 AM

https://android-review.googlesource.com/c/platform/external/qemu/+/844675

Latest version of patch doesn't work with QEMU anymore, at least with those annotations. Error log:

qemu_coroutine_new:181:0x7f02938cf700:0x7f029388f7c0 Start new coroutine
on_new_fiber:94:0x7f02938cf700:0x7f029388f7c0 Create new TSAN co fiber. co: 0x7b4400025940 co fiber: 0x7f0292cc0008 caller fiber: 0x7f029388f7c0
start_switch_fiber:133:0x7f02938cf700:0x7f029388f7c0 Current fiber: 0x7f029388f7c0.
start_switch_fiber:135:0x7f02938cf700:0x7f029388f7c0 Switch to fiber 0x7f0292cc0008
coroutine_trampoline:142:0x7f02938cf700:0x7f0292cc0008 Start trampoline
coroutine_trampoline:157:0x7f02938cf700:0x7f0292cc0008 Current fiber: 0x7f0292cc0008. Set co 0x7b4400025940 to env 0x7b44000259a8
start_switch_fiber:133:0x7f02938cf700:0x7f0292cc0008 Current fiber: 0x7f0292cc0008.
start_switch_fiber:135:0x7f02938cf700:0x7f0292cc0008 Switch to fiber 0x7f029388f7c0
coroutine_trampoline:164:0x7f02938cf700:0x7f029388f7c0 Jump to co 0x7b4400025940 caller fiber 0x7f029388f7c0 env 0x7b4400025940
FATAL: ThreadSanitizer CHECK failed: /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:531 "((thr->shadow_stack_pos)) >= ((buf->shadow_stack_pos))" (0x60000bbe0078, 0x60000bbe0080)

#0 __tsan::TsanCheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_rtl_report.cc:48:25 (qemu-system-x86_64+0x536168)
#1 __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_termination.cc:79:5 (qemu-system-x86_64+0x4bcf4f)
#2 LongJmp(__tsan::ThreadState*, unsigned long*) /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:531:7 (qemu-system-x86_64+0x4d2311)
#3 siglongjmp /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:647:3 (qemu-system-x86_64+0x4d23ce)
#4 coroutine_trampoline /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/coroutine-ucontext.c:165:9 (qemu-system-x86_64+0xb72000)
#5 <null> <null> (libc.so.6+0x43fcf)

Do you know what I'm doing wrong here?

yuri added a comment.Dec 18 2018, 8:41 AM

https://android-review.googlesource.com/c/platform/external/qemu/+/844675

Latest version of patch doesn't work with QEMU anymore, at least with those annotations. Error log:

Do you know what I'm doing wrong here?


Please check if adding __attribute__((always_inline)) for start_switch_fiber() and finish_switch_fiber() helps

741g added a comment.Dec 18 2018, 8:50 AM

https://android-review.googlesource.com/c/platform/external/qemu/+/844675

Latest version of patch doesn't work with QEMU anymore, at least with those annotations. Error log:

Do you know what I'm doing wrong here?


Please check if adding __attribute__((always_inline)) for start_switch_fiber() and finish_switch_fiber() helps

That worked, thanks! Sorry, I had updated the patch with a refactoring the meantime that probably actually caused the break.

741g added a comment.Dec 19 2018, 1:07 PM

https://android-review.googlesource.com/c/platform/external/qemu/+/844675

Latest version of patch doesn't work with QEMU anymore, at least with those annotations. Error log:

Do you know what I'm doing wrong here?


Please check if adding __attribute__((always_inline)) for start_switch_fiber() and finish_switch_fiber() helps

That worked, thanks! Sorry, I had updated the patch with a refactoring the meantime that probably actually caused the break.

QEMU status: Fewer false positives now; there are many warnings that seem more real now, mostly about not atomically reading variables that got atomically updated. There might be other issues as well.

741g added a comment.Dec 19 2018, 1:23 PM

https://android-review.googlesource.com/c/platform/external/qemu/+/844675

Latest version of patch doesn't work with QEMU anymore, at least with those annotations. Error log:

Do you know what I'm doing wrong here?


Please check if adding __attribute__((always_inline)) for start_switch_fiber() and finish_switch_fiber() helps

That worked, thanks! Sorry, I had updated the patch with a refactoring the meantime that probably actually caused the break.

QEMU status: Fewer false positives now; there are many warnings that seem more real now, mostly about not atomically reading variables that got atomically updated. There might be other issues as well.

WARNING: ThreadSanitizer: data race (pid=3742) Atomic write of size 1 at 0x7b0c00051620 by thread T14 (mutexes: write M1097): #0 __tsan_atomic8_exchange /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interface_atomic.cc:579:3 (qemu-system-x86_64+0x51d0a8) #1 qemu_bh_schedule /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:167:9 (qemu-system-x86_64+0xb4675e) #2 worker_thread /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/thread-pool.c:114:9 (qemu-system-x86_64+0xb4838c) #3 qemu_thread_trampoline /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-thread-posix.c:551:17 (qemu-system-x86_64+0xb4fe36)
Previous read of size 1 at 0x7b0c00051620 by thread T10 (mutexes: write M985):
  #0 aio_compute_timeout /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:198:17 (qemu-system-x86_64+0xb46868)
  #1 aio_poll /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/aio-posix.c:617:26 (qemu-system-x86_64+0xb4c38c)
  #2 blk_prw /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1247:9 (qemu-system-x86_64+0x98bbbd)
  #3 blk_pread /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1409:15 (qemu-system-x86_64+0x98b91a)
  #4 find_image_format /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:701:11 (qemu-system-x86_64+0x958068)
  #5 bdrv_open_inherit /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:2690 (qemu-system-x86_64+0x958068)
  #6 bdrv_open /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:2802:12 (qemu-system-x86_64+0x958d1f)
  #7 blk_new_open /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:375:10 (qemu-system-x86_64+0x98927a)
  #8 blockdev_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../blockdev.c:598:15 (qemu-system-x86_64+0x9c5700)
  #9 drive_new /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../blockdev.c:1092 (qemu-system-x86_64+0x9c5700)
  #10 drive_init(void*, QemuOpts*, Error**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/drive-share.cpp:366:28 (qemu-system-x86_64+0xb1e9a2)
  #11 qemu_opts_foreach /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-option.c:1106:14 (qemu-system-x86_64+0xb696e5)
  #12 android_drive_share_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/drive-share.cpp:530:9 (qemu-system-x86_64+0xb1e1e0)
  #13 main_impl /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:5244:13 (qemu-system-x86_64+0x54f811)
  #14 run_qemu_main /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:3349:21 (qemu-system-x86_64+0x547a02)
  #15 enter_qemu_main_loop(int, char**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/main.cpp:606:5 (qemu-system-x86_64+0x54427c)
  #16 MainLoopThread::run() /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android/android-emu/android/skin/qt/emulator-qt-window.h:73:13 (qemu-system-x86_64+0xc40b41)
  #17 QThreadPrivate::start(void*) /usr/local/google/home/joshuaduong/qt-build/src/qt-everywhere-src-5.11.1/qtbase/src/corelib/thread/qthread_unix.cpp:367:14 (libQt5Core.so.5+0xa7d35)

Location is heap block of size 40 at 0x7b0c00051600 allocated by thread T13:
  #0 malloc /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:667:5 (qemu-system-x86_64+0x4d246c)
  #1 g_malloc /tmp/jansene-build-temp-193494/src/glib-2.38.2/glib/gmem.c:104 (qemu-system-x86_64+0xecfdc0)
  #2 thread_pool_init_one /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/thread-pool.c:307:27 (qemu-system-x86_64+0xb47a63)
  #3 thread_pool_new /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/thread-pool.c:327 (qemu-system-x86_64+0xb47a63)
  #4 aio_get_thread_pool /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:320:28 (qemu-system-x86_64+0xb46934)
  #5 paio_submit_co /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/file-posix.c:1565:12 (qemu-system-x86_64+0xa15d23)
  #6 raw_co_prw /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/file-posix.c:1620 (qemu-system-x86_64+0xa15d23)
  #7 raw_co_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/file-posix.c:1627 (qemu-system-x86_64+0xa15d23)
  #8 bdrv_driver_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/io.c:924:16 (qemu-system-x86_64+0x8d91f7)
  #9 bdrv_aligned_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/io.c:1228 (qemu-system-x86_64+0x8d91f7)
  #10 bdrv_co_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/io.c:1324:11 (qemu-system-x86_64+0x8d8dd7)
  #11 blk_co_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1158:11 (qemu-system-x86_64+0x98b0da)
  #12 blk_read_entry /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1206:17 (qemu-system-x86_64+0x98c1e3)
  #13 coroutine_trampoline /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/coroutine-ucontext.c:173:9 (qemu-system-x86_64+0xb71c8a)
  #14 <null> <null> (libc.so.6+0x43fcf)

Mutex M1097 (0x7b3800009bd0) created at:
  #0 pthread_mutex_init /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:1187:3 (qemu-system-x86_64+0x4d4ebc)
  #1 qemu_mutex_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-thread-posix.c:61:11 (qemu-system-x86_64+0xb4f2a7)
  #2 thread_pool_init_one /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/thread-pool.c:308:5 (qemu-system-x86_64+0xb47a7b)
  #3 thread_pool_new /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/thread-pool.c:327 (qemu-system-x86_64+0xb47a7b)
  #4 aio_get_thread_pool /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:320:28 (qemu-system-x86_64+0xb46934)
  #5 paio_submit_co /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/file-posix.c:1565:12 (qemu-system-x86_64+0xa15d23)
  #6 raw_co_prw /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/file-posix.c:1620 (qemu-system-x86_64+0xa15d23)
  #7 raw_co_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/file-posix.c:1627 (qemu-system-x86_64+0xa15d23)
  #8 bdrv_driver_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/io.c:924:16 (qemu-system-x86_64+0x8d91f7)
  #9 bdrv_aligned_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/io.c:1228 (qemu-system-x86_64+0x8d91f7)
  #10 bdrv_co_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/io.c:1324:11 (qemu-system-x86_64+0x8d8dd7)
  #11 blk_co_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1158:11 (qemu-system-x86_64+0x98b0da)
  #12 blk_read_entry /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1206:17 (qemu-system-x86_64+0x98c1e3)
  #13 coroutine_trampoline /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/coroutine-ucontext.c:173:9 (qemu-system-x86_64+0xb71c8a)
  #14 <null> <null> (libc.so.6+0x43fcf)

Mutex M985 (0x000003bb0fb8) created at:
  #0 pthread_mutex_init /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:1187:3 (qemu-system-x86_64+0x4d4ebc)
  #1 qemu_mutex_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-thread-posix.c:61:11 (qemu-system-x86_64+0xb4f2a7)
  #2 qemu_init_cpu_loop /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../cpus.c:1123:5 (qemu-system-x86_64+0x5ddb0c)
  #3 main_impl /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:3437:5 (qemu-system-x86_64+0x547ae0)
  #4 run_qemu_main /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:3349:21 (qemu-system-x86_64+0x547a02)
  #5 enter_qemu_main_loop(int, char**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/main.cpp:606:5 (qemu-system-x86_64+0x54427c)
  #6 MainLoopThread::run() /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android/android-emu/android/skin/qt/emulator-qt-window.h:73:13 (qemu-system-x86_64+0xc40b41)
  #7 QThreadPrivate::start(void*) /usr/local/google/home/joshuaduong/qt-build/src/qt-everywhere-src-5.11.1/qtbase/src/corelib/thread/qthread_unix.cpp:367:14 (libQt5Core.so.5+0xa7d35)

Thread T14 (tid=3886, running) created by thread T10 at:
  #0 pthread_create /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:968:3 (qemu-system-x86_64+0x4d3cda)
  #1 qemu_thread_create /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-thread-posix.c:591:11 (qemu-system-x86_64+0xb4fcef)
  #2 do_spawn_thread /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/thread-pool.c:135:5 (qemu-system-x86_64+0xb48072)
  #3 spawn_thread_bh_fn /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/thread-pool.c:143 (qemu-system-x86_64+0xb48072)
  #4 aio_bh_call /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:90:5 (qemu-system-x86_64+0xb46616)
  #5 aio_bh_poll /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:118 (qemu-system-x86_64+0xb46616)
  #6 aio_poll /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/aio-posix.c:706:17 (qemu-system-x86_64+0xb4cf45)
  #7 blk_prw /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1247:9 (qemu-system-x86_64+0x98bbbd)
  #8 blk_pread /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1409:15 (qemu-system-x86_64+0x98b91a)
  #9 find_image_format /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:701:11 (qemu-system-x86_64+0x958068)
  #10 bdrv_open_inherit /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:2690 (qemu-system-x86_64+0x958068)
  #11 bdrv_open /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:2802:12 (qemu-system-x86_64+0x958d1f)
  #12 blk_new_open /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:375:10 (qemu-system-x86_64+0x98927a)
  #13 blockdev_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../blockdev.c:598:15 (qemu-system-x86_64+0x9c5700)
  #14 drive_new /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../blockdev.c:1092 (qemu-system-x86_64+0x9c5700)
  #15 drive_init(void*, QemuOpts*, Error**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/drive-share.cpp:366:28 (qemu-system-x86_64+0xb1e9a2)
  #16 qemu_opts_foreach /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-option.c:1106:14 (qemu-system-x86_64+0xb696e5)
  #17 android_drive_share_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/drive-share.cpp:530:9 (qemu-system-x86_64+0xb1e1e0)
  #18 main_impl /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:5244:13 (qemu-system-x86_64+0x54f811)
  #19 run_qemu_main /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:3349:21 (qemu-system-x86_64+0x547a02)
  #20 enter_qemu_main_loop(int, char**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/main.cpp:606:5 (qemu-system-x86_64+0x54427c)
  #21 MainLoopThread::run() /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android/android-emu/android/skin/qt/emulator-qt-window.h:73:13 (qemu-system-x86_64+0xc40b41)
  #22 QThreadPrivate::start(void*) /usr/local/google/home/joshuaduong/qt-build/src/qt-everywhere-src-5.11.1/qtbase/src/corelib/thread/qthread_unix.cpp:367:14 (libQt5Core.so.5+0xa7d35)

Thread T10 'MainLoopThread' (tid=3881, running) created by main thread at:
  #0 pthread_create /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:968:3 (qemu-system-x86_64+0x4d3cda)
  #1 QThread::start(QThread::Priority) /usr/local/google/home/joshuaduong/qt-build/src/qt-everywhere-src-5.11.1/qtbase/src/corelib/thread/qthread_unix.cpp:726:16 (libQt5Core.so.5+0xa84fb)
  #2 skin_winsys_spawn_thread /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android/android-emu/android/skin/qt/winsys-qt.cpp:519:17 (qemu-system-x86_64+0xbcf68b)
  #3 main /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/main.cpp:1624:5 (qemu-system-x86_64+0x543f9c)

Thread T13 (tid=0, running) created by thread T10 at:
  #0 on_new_fiber /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/coroutine-ucontext.c:90:25 (qemu-system-x86_64+0xb71b48)
  #1 qemu_coroutine_new /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/coroutine-ucontext.c:217 (qemu-system-x86_64+0xb71b48)
  #2 qemu_coroutine_create /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-coroutine.c:88:14 (qemu-system-x86_64+0xb70349)
  #3 blk_prw /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1245:25 (qemu-system-x86_64+0x98baca)
  #4 blk_pread /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1409:15 (qemu-system-x86_64+0x98b91a)
  #5 find_image_format /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:701:11 (qemu-system-x86_64+0x958068)
  #6 bdrv_open_inherit /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:2690 (qemu-system-x86_64+0x958068)
  #7 bdrv_open /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:2802:12 (qemu-system-x86_64+0x958d1f)
  #8 blk_new_open /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:375:10 (qemu-system-x86_64+0x98927a)
  #9 blockdev_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../blockdev.c:598:15 (qemu-system-x86_64+0x9c5700)
  #10 drive_new /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../blockdev.c:1092 (qemu-system-x86_64+0x9c5700)
  #11 drive_init(void*, QemuOpts*, Error**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/drive-share.cpp:366:28 (qemu-system-x86_64+0xb1e9a2)
  #12 qemu_opts_foreach /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-option.c:1106:14 (qemu-system-x86_64+0xb696e5)
  #13 android_drive_share_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/drive-share.cpp:530:9 (qemu-system-x86_64+0xb1e1e0)
  #14 main_impl /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:5244:13 (qemu-system-x86_64+0x54f811)
  #15 run_qemu_main /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:3349:21 (qemu-system-x86_64+0x547a02)
  #16 enter_qemu_main_loop(int, char**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/main.cpp:606:5 (qemu-system-x86_64+0x54427c)
  #17 MainLoopThread::run() /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android/android-emu/android/skin/qt/emulator-qt-window.h:73:13 (qemu-system-x86_64+0xc40b41)
  #18 QThreadPrivate::start(void*) /usr/local/google/home/joshuaduong/qt-build/src/qt-everywhere-src-5.11.1/qtbase/src/corelib/thread/qthread_unix.cpp:367:14 (libQt5Core.so.5+0xa7d35)

SUMMARY: ThreadSanitizer: data race /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:167:9 in qemu_bh_schedule

WARNING: ThreadSanitizer: data race (pid=3742) Read of size 4 at 0x7b4400025758 by thread T14 (mutexes: write M1097): #0 aio_notify /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:342:14 (qemu-system-x86_64+0xb46778) #1 qemu_bh_schedule /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:168 (qemu-system-x86_64+0xb46778) #2 worker_thread /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/thread-pool.c:114:9 (qemu-system-x86_64+0xb4838c) #3 qemu_thread_trampoline /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-thread-posix.c:551:17 (qemu-system-x86_64+0xb4fe36)
Previous atomic write of size 4 at 0x7b4400025758 by thread T10 (mutexes: write M985):
  #0 __tsan_atomic32_fetch_add /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interface_atomic.cc:616:3 (qemu-system-x86_64+0x51df18)
  #1 aio_poll /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/aio-posix.c:608:9 (qemu-system-x86_64+0xb4c342)
  #2 blk_prw /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1247:9 (qemu-system-x86_64+0x98bbbd)
  #3 blk_pread /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1409:15 (qemu-system-x86_64+0x98b91a)
  #4 find_image_format /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:701:11 (qemu-system-x86_64+0x958068)
  #5 bdrv_open_inherit /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:2690 (qemu-system-x86_64+0x958068)
  #6 bdrv_open /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:2802:12 (qemu-system-x86_64+0x958d1f)
  #7 blk_new_open /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:375:10 (qemu-system-x86_64+0x98927a)
  #8 blockdev_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../blockdev.c:598:15 (qemu-system-x86_64+0x9c5700)
  #9 drive_new /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../blockdev.c:1092 (qemu-system-x86_64+0x9c5700)
  #10 drive_init(void*, QemuOpts*, Error**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/drive-share.cpp:366:28 (qemu-system-x86_64+0xb1e9a2)
  #11 qemu_opts_foreach /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-option.c:1106:14 (qemu-system-x86_64+0xb696e5)
  #12 android_drive_share_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/drive-share.cpp:530:9 (qemu-system-x86_64+0xb1e1e0)
  #13 main_impl /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:5244:13 (qemu-system-x86_64+0x54f811)
  #14 run_qemu_main /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:3349:21 (qemu-system-x86_64+0x547a02)
  #15 enter_qemu_main_loop(int, char**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/main.cpp:606:5 (qemu-system-x86_64+0x54427c)
  #16 MainLoopThread::run() /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android/android-emu/android/skin/qt/emulator-qt-window.h:73:13 (qemu-system-x86_64+0xc40b41)
  #17 QThreadPrivate::start(void*) /usr/local/google/home/joshuaduong/qt-build/src/qt-everywhere-src-5.11.1/qtbase/src/corelib/thread/qthread_unix.cpp:367:14 (libQt5Core.so.5+0xa7d35)

Location is heap block of size 296 at 0x7b44000256c0 allocated by thread T10:
  #0 calloc /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:684:5 (qemu-system-x86_64+0x4d26ef)
  #1 g_malloc0 /tmp/jansene-build-temp-193494/src/glib-2.38.2/glib/gmem.c:134 (qemu-system-x86_64+0xecfe18)
  #2 qemu_init_main_loop /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/main-loop.c:165:24 (qemu-system-x86_64+0xb4abac)
  #3 main_impl /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:4605:9 (qemu-system-x86_64+0x54b937)
  #4 run_qemu_main /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:3349:21 (qemu-system-x86_64+0x547a02)
  #5 enter_qemu_main_loop(int, char**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/main.cpp:606:5 (qemu-system-x86_64+0x54427c)
  #6 MainLoopThread::run() /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android/android-emu/android/skin/qt/emulator-qt-window.h:73:13 (qemu-system-x86_64+0xc40b41)
  #7 QThreadPrivate::start(void*) /usr/local/google/home/joshuaduong/qt-build/src/qt-everywhere-src-5.11.1/qtbase/src/corelib/thread/qthread_unix.cpp:367:14 (libQt5Core.so.5+0xa7d35)

Mutex M1097 (0x7b3800009bd0) created at:
  #0 pthread_mutex_init /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:1187:3 (qemu-system-x86_64+0x4d4ebc)
  #1 qemu_mutex_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-thread-posix.c:61:11 (qemu-system-x86_64+0xb4f2a7)
  #2 thread_pool_init_one /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/thread-pool.c:308:5 (qemu-system-x86_64+0xb47a7b)
  #3 thread_pool_new /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/thread-pool.c:327 (qemu-system-x86_64+0xb47a7b)
  #4 aio_get_thread_pool /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:320:28 (qemu-system-x86_64+0xb46934)
  #5 paio_submit_co /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/file-posix.c:1565:12 (qemu-system-x86_64+0xa15d23)
  #6 raw_co_prw /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/file-posix.c:1620 (qemu-system-x86_64+0xa15d23)
  #7 raw_co_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/file-posix.c:1627 (qemu-system-x86_64+0xa15d23)
  #8 bdrv_driver_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/io.c:924:16 (qemu-system-x86_64+0x8d91f7)
  #9 bdrv_aligned_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/io.c:1228 (qemu-system-x86_64+0x8d91f7)
  #10 bdrv_co_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/io.c:1324:11 (qemu-system-x86_64+0x8d8dd7)
  #11 blk_co_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1158:11 (qemu-system-x86_64+0x98b0da)
  #12 blk_read_entry /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1206:17 (qemu-system-x86_64+0x98c1e3)
  #13 coroutine_trampoline /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/coroutine-ucontext.c:173:9 (qemu-system-x86_64+0xb71c8a)
  #14 <null> <null> (libc.so.6+0x43fcf)

Mutex M985 (0x000003bb0fb8) created at:
  #0 pthread_mutex_init /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:1187:3 (qemu-system-x86_64+0x4d4ebc)
  #1 qemu_mutex_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-thread-posix.c:61:11 (qemu-system-x86_64+0xb4f2a7)
  #2 qemu_init_cpu_loop /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../cpus.c:1123:5 (qemu-system-x86_64+0x5ddb0c)
  #3 main_impl /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:3437:5 (qemu-system-x86_64+0x547ae0)
  #4 run_qemu_main /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:3349:21 (qemu-system-x86_64+0x547a02)
  #5 enter_qemu_main_loop(int, char**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/main.cpp:606:5 (qemu-system-x86_64+0x54427c)
  #6 MainLoopThread::run() /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android/android-emu/android/skin/qt/emulator-qt-window.h:73:13 (qemu-system-x86_64+0xc40b41)
  #7 QThreadPrivate::start(void*) /usr/local/google/home/joshuaduong/qt-build/src/qt-everywhere-src-5.11.1/qtbase/src/corelib/thread/qthread_unix.cpp:367:14 (libQt5Core.so.5+0xa7d35)

Thread T14 (tid=3886, running) created by thread T10 at:
  #0 pthread_create /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:968:3 (qemu-system-x86_64+0x4d3cda)
  #1 qemu_thread_create /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-thread-posix.c:591:11 (qemu-system-x86_64+0xb4fcef)
  #2 do_spawn_thread /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/thread-pool.c:135:5 (qemu-system-x86_64+0xb48072)
  #3 spawn_thread_bh_fn /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/thread-pool.c:143 (qemu-system-x86_64+0xb48072)
  #4 aio_bh_call /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:90:5 (qemu-system-x86_64+0xb46616)
  #5 aio_bh_poll /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:118 (qemu-system-x86_64+0xb46616)
  #6 aio_poll /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/aio-posix.c:706:17 (qemu-system-x86_64+0xb4cf45)
  #7 blk_prw /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1247:9 (qemu-system-x86_64+0x98bbbd)
  #8 blk_pread /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1409:15 (qemu-system-x86_64+0x98b91a)
  #9 find_image_format /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:701:11 (qemu-system-x86_64+0x958068)
  #10 bdrv_open_inherit /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:2690 (qemu-system-x86_64+0x958068)
  #11 bdrv_open /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:2802:12 (qemu-system-x86_64+0x958d1f)
  #12 blk_new_open /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:375:10 (qemu-system-x86_64+0x98927a)
  #13 blockdev_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../blockdev.c:598:15 (qemu-system-x86_64+0x9c5700)
  #14 drive_new /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../blockdev.c:1092 (qemu-system-x86_64+0x9c5700)
  #15 drive_init(void*, QemuOpts*, Error**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/drive-share.cpp:366:28 (qemu-system-x86_64+0xb1e9a2)
  #16 qemu_opts_foreach /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-option.c:1106:14 (qemu-system-x86_64+0xb696e5)
  #17 android_drive_share_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/drive-share.cpp:530:9 (qemu-system-x86_64+0xb1e1e0)
  #18 main_impl /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:5244:13 (qemu-system-x86_64+0x54f811)
  #19 run_qemu_main /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:3349:21 (qemu-system-x86_64+0x547a02)
  #20 enter_qemu_main_loop(int, char**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/main.cpp:606:5 (qemu-system-x86_64+0x54427c)
  #21 MainLoopThread::run() /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android/android-emu/android/skin/qt/emulator-qt-window.h:73:13 (qemu-system-x86_64+0xc40b41)
  #22 QThreadPrivate::start(void*) /usr/local/google/home/joshuaduong/qt-build/src/qt-everywhere-src-5.11.1/qtbase/src/corelib/thread/qthread_unix.cpp:367:14 (libQt5Core.so.5+0xa7d35)

Thread T10 'MainLoopThread' (tid=3881, running) created by main thread at:
  #0 pthread_create /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:968:3 (qemu-system-x86_64+0x4d3cda)
  #1 QThread::start(QThread::Priority) /usr/local/google/home/joshuaduong/qt-build/src/qt-everywhere-src-5.11.1/qtbase/src/corelib/thread/qthread_unix.cpp:726:16 (libQt5Core.so.5+0xa84fb)
  #2 skin_winsys_spawn_thread /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android/android-emu/android/skin/qt/winsys-qt.cpp:519:17 (qemu-system-x86_64+0xbcf68b)
  #3 main /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/main.cpp:1624:5 (qemu-system-x86_64+0x543f9c)

SUMMARY: ThreadSanitizer: data race /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:342:14 in aio_notify

yuri updated this revision to Diff 179049.Dec 20 2018, 5:41 AM
  • introduce ThreadType enum
  • cleanup internal interfaces
  • add pthread_exit() interceptor
  • more tests
yuri added a comment.EditedDec 20 2018, 7:52 AM

I don't completely follow logic behind cur_thread/cur_thread_fast/cur_thread1 and how it does not introduce slowdown.
cur_thread_fast still includes an indirection, so it looks exactly the same as the first version.
cur_thread_fast does not check for NULL, so I would expect it to be used only in few carefully chosen functions. But it's used in most entry functions. Which makes me wonder: when cur_thread_fast cannot be used if it can be used in all these places?
It seems that you use cur_thread only to lazily initialize cur_thread1 in Initialize or ThreadStart. But then the question is: why don't we explicitly intialize cur_thread1 in these functions, and then just have a single cur_thread function as before with no 2 subtly different accessors?

I assume that after program or new thread starts sanitizer will first see few (or maybe zero) interceptors, then _func_entry() and then everything else. Following this logic, I use cur_thread_fast() in all performance-critical entry points excluding _func_entry().

I ran tsan benchmarks and it looks like memory indirection by itself does not affect performance because variable is in CPU cache in most cases. At the same time, conditional check has visible effect on performance.

Here was discussion of benchmark results. I removed it because looks like times are randomly floating around 3-4% which is more then impact of the patch.

yuri marked 13 inline comments as done.Dec 20 2018, 8:05 AM
yuri added inline comments.
lib/tsan/rtl/tsan_rtl.h
886

We can't use enum type of parameter because it is combination of bits. This is done the same way as enum MutexFlags in tsan_sync.h

yuri marked an inline comment as done.Dec 20 2018, 8:05 AM
yuri added a comment.Dec 20 2018, 8:15 AM

I added default synchronization and flag to opt-out.

Can you swicth to it in your codebase? I would expect that you need all (or almost all) of this synchronization anyway?

Unfortunately, I can't check it. Current implementation of our codebase work fine with synchronization on fiber switch.

I plan to implement another mode when fibers are not synchronized by default, but only when they call special synchronization APIs (events, mutexes etc.). Such way I can catch more errors in code when fibers are running in the same thread only by chance. In order to implement it, I need some way to opt-out of synchronization in tsan.

yuri updated this revision to Diff 179085.Dec 20 2018, 8:26 AM

Fix typo

I added default synchronization and flag to opt-out.

Can you swicth to it in your codebase? I would expect that you need all (or almost all) of this synchronization anyway?

Unfortunately, I can't check it. Current implementation of our codebase work fine with synchronization on fiber switch.

I plan to implement another mode when fibers are not synchronized by default, but only when they call special synchronization APIs (events, mutexes etc.). Such way I can catch more errors in code when fibers are running in the same thread only by chance. In order to implement it, I need some way to opt-out of synchronization in tsan.

Sorry, I meant to switch to the built-in synchronization to validate that it works in real projects and either removes the need in manual synchronization annotations or at least significantly reduces the need in manual annotations.

lib/tsan/rtl/tsan_rtl.h
886

Then I guess I am out of arguments because I wrote that code :)

Re performance, we have this open performance regression in clang codegen which makes it harder to do analysis:
https://bugs.llvm.org/show_bug.cgi?id=39748
https://reviews.llvm.org/D54821

I've benchmarked on 350140 with host gcc version 7.3.0 (Debian 7.3.0-5), running old/new binary alternated:

int main() {
  const int kSize = 2<<10;
  const int kRepeat = 1<<19;
  volatile long data[kSize];
  for (int i = 0; i < kRepeat; i++) {
    for (int j = 0; j < kSize; j++)
      data[j] = 1;
    __atomic_load_n(&data[0], __ATOMIC_ACQUIRE);
    __atomic_store_n(&data[0], 1, __ATOMIC_RELEASE);
 }
}

compiler-rt$ TIME="%e" nice -20 time taskset -c 0 ./current.test
8.68
8.71
8.63
8.67
8.73
8.63

compiler-rt$ TIME="%e" nice -20 time taskset -c 0 ./fiber.test
9.97
9.85
9.99
9.91
9.88
9.92
9.89
9.88

This looks like 14% degradation.

yuri updated this revision to Diff 180547.Jan 7 2019, 12:46 PM

Improved performance

yuri added a comment.Jan 7 2019, 12:58 PM

I improved test execution time. On my system I got following execution times (compared to original version of code):

compilercurrentfibers
gcc 8.2100.0%98.6%
clang112.4%100.7%

I improved test execution time. On my system I got following execution times (compared to original version of code):

compilercurrentfibers
gcc 8.2100.0%98.6%
clang112.4%100.7%

"current" is compiler-rt HEAD with no local changes, or with the previous version of this change? This makes clang-compiled runtime 12% faster?
Were you able to reproduce the slowdown in https://reviews.llvm.org/D54889#1343582 with the previous version of the fibers change? What slowdown did you get?

I improved test execution time. On my system I got following execution times (compared to original version of code):

compilercurrentfibers
gcc 8.2100.0%98.6%
clang112.4%100.7%

"current" is compiler-rt HEAD with no local changes, or with the previous version of this change? This makes clang-compiled runtime 12% faster?
Were you able to reproduce the slowdown in https://reviews.llvm.org/D54889#1343582 with the previous version of the fibers change? What slowdown did you get?

What exactly has changed wrt performance? I see some small tweaks, but I am not sure if they are the main reason behind the speedup or I am missing something important.

yuri added a comment.Jan 8 2019, 2:23 AM

"current" is compiler-rt HEAD with no local changes, or with the previous version of this change? This makes clang-compiled runtime 12% faster?

"current" is compiler-rt HEAD without any changes.
"clang" is LLVM and clang HEAD.

In case of clang-compiled library, even previous version of patch was faster then HEAD.

Looks like additional indirection of ThreadState by itself introduce minimal overhead, but affect code generation in unpredictable way, especially for clang.

Were you able to reproduce the slowdown in https://reviews.llvm.org/D54889#1343582 with the previous version of the fibers change? What slowdown did you get?

Yes, I can see slowdown with previous version of patch and gcc-compiled library around 3.5%.

What exactly has changed wrt performance? I see some small tweaks, but I am not sure if they are the main reason behind the speedup or I am missing something important.

You may consider this as "cheating" because changes are general and not related to fibers.

Most speedup is because of change in tsan_rtl.cc:

- if (!SANITIZER_GO && *shadow_mem == kShadowRodata) {
+ if (!SANITIZER_GO && !kAccessIsWrite && *shadow_mem == kShadowRodata) {

Placing LIKELY/UNLIKELY in code gives additional 1-2%.

Benchmark results with clang:

write8:
HEAD: 9.74
fibers(old): 9.40
fibers(new): 8.66

read8:
HEAD: 9.92
fibers(new): 9.26

read4:
HEAD: 10.20
fibers(new): 9.36

fibers(new) is the current version of the change. fibers(old) is the one I used with gcc (no spot optimizations).

So this change indeed makes it faster for clang, but the fastest clang version is still slower then gcc, so this suggests that there is something to improve in clang codegen. If we improve clang codegen, will this change again lead to slower code or not? It's bad that we have that open clang regression...

yuri updated this revision to Diff 181253.Jan 11 2019, 5:19 AM

Optimized MemoryAccessImpl1()

yuri added a comment.Jan 11 2019, 5:23 AM

I have found out which optimization is applied by gcc but not by clang and did it manually. Executions times for new version:

compilerHEADfibers
gcc 8.2100.0%94.4%
clang112.4%94.1%
yuri added a comment.Jan 21 2019, 8:01 AM

Hi Dmitry,
Now, when all performance issues are resolved, what else should be done to complete the review?

Now, when all performance issues are resolved, what else should be done to complete the review?

Just me finding time to review it.
I need to put together all numbers. Is it faster than it was (that is, new optimizations outweigh the change to cur_thread)? Or the change to cur_thread makes it faster by itself? I am somewhat concerned by the "cheating", I understand that it's you who found these potential additional optimizations, but these still unrelated to the cur_thread change. Thousands of existing tsan users don't use fibers at all today.
Thanks for your persistence btw.

Spent a while benchmarking this. I used 4 variations of a test program:
First just does read or write access of the given size:
https://gist.githubusercontent.com/dvyukov/9155651ecb9434368f06a30df2abcb20/raw/c0756a9d643718faaacda2c78ace1cb95f05af87/gistfile1.txt
Second, prepopulates shadow with some other accesses first:
https://gist.githubusercontent.com/dvyukov/12b4ca0a15ffccccdaf9e92d20df40e2/raw/a2cdfc65fccf1eecdc232c876dc45e41229f6ce1/gistfile1.txt
Next has 2 threads that access the range in round-robin, both threads do writes (or reads):
https://gist.githubusercontent.com/dvyukov/0d84c9d9795cbbd8b9b9bb557d1e2f36/raw/24dcdf48f39cd12ac77ae71e9de7f8ee01c17df2/gistfile1.txt
The last one has 1 threads that does writes, followed by 2 threads that do reads:
https://gist.githubusercontent.com/dvyukov/1efa2c0335d9d944b33857dbd70cb8ce/raw/6eff5ea186fd5d8540090faaf344ad66250c6e5e/gistfile1.txt

I used 3 variations of the runtime: current tsan runtime without modifications, this change, this change with cur_thread/cur_thread_fast returning reinterpret_cast<ThreadState *>(cur_thread_placeholder). In all cases I used self-hosted clang build on revision 353000.
Each test was run 3 times and the fastest time was taken.
The current code was the slowest in all cases, so I removed from the results. Here is slowdown in % for fiber support as compared to returning cur_thread_placeholder:

Fibers seem to incur ~4% slowdown on average.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 5 2019, 6:10 AM

Since fiber support incurs slowdown for all current tsan users who don't use fibers, this is a hard decision.

I've prototyped a change which leaves fast_state and fast_synch_epoch in TLS (the only state accessed on fast path besides the clock):

struct FastThreadState {
  FastState fast_state;
  u64 fast_synch_epoch;
};

__attribute__((tls_model("initial-exec")))
extern THREADLOCAL char cur_thread_faststate1[];

INLINE FastThreadState& cur_thread_faststate() {
  return *reinterpret_cast<FastThreadState *>(cur_thread_faststate1);
}

But this seems to be even slower than just using the single pointer indirection.

Also benchmarked function entry/exit using the following benchmark:

// foo1.c
void foo(bool x);
void bar() {}
int main() {

volatile int kRepeat = 1 << 30;
const int repeat = kRepeat;
for (int i = 0; i < repeat; i++)
  foo(false);

}

// foo2.c
void bar();
void foo(bool x) {

if (x)
  bar();

}

The program spends ~75% of time in __tsan_func_entry/exit. Rest of the conditions are the same as in the previous benchmark.

Current code runs in 7.16s
This change -- 6.23s
This change + cur_thread_fast returning cur_thread_placeholder -- 7.01s
I also tried this change + FuncEntry using cur_thread_fast -- 6.20s

Using the pointer indirection seems to positively affect func entry/exit codegen.

Current code runs in 7.16s
This change -- 6.23s
This change + cur_thread_fast returning cur_thread_placeholder -- 7.01s
I also tried this change + FuncEntry using cur_thread_fast -- 6.20s

But if I do:

INLINE ThreadState *cur_thread_fast() {
  ThreadState* thr;
  __asm__("": "=a"(thr): "a"(&cur_thread_placeholder[0]));
  return thr;
}

(which is a dirty trick to force compiler to cache address of the tls object in a register) then the program runs 5.94s -- faster than any other options as it takes advantage of both no indirection and faster instructions.

But this is not beneficial for __tsan_read/write functions because caching the address takes a register and these functions are already severely short on registers.

Going forward I think we should get in all unrelated/preparatory changed first: thread type (creates lots of diffs), pthread_exit interceptor/test and spot optimizations to memory access functions.

yuri added a comment.EditedFeb 6 2019, 12:45 PM

Going forward I think we should get in all unrelated/preparatory changed first: thread type (creates lots of diffs), pthread_exit interceptor/test and spot optimizations to memory access functions.

Great to hear it!

I have uploaded first part (thread type) at D57839.

yuri added a comment.Feb 6 2019, 11:50 PM

One more patch (pthread_exit): D57876

yuri added a comment.Feb 7 2019, 3:17 AM

Next patch (performance optimizations): D57882

yuri updated this revision to Diff 185745.Feb 7 2019, 4:57 AM

Rebased on current trunk

yuri marked an inline comment as done.Feb 7 2019, 4:58 AM

Current code runs in 7.16s
This change -- 6.23s
This change + cur_thread_fast returning cur_thread_placeholder -- 7.01s
I also tried this change + FuncEntry using cur_thread_fast -- 6.20s

But if I do:

INLINE ThreadState *cur_thread_fast() {
  ThreadState* thr;
  __asm__("": "=a"(thr): "a"(&cur_thread_placeholder[0]));
  return thr;
}

(which is a dirty trick to force compiler to cache address of the tls object in a register) then the program runs 5.94s -- faster than any other options as it takes advantage of both no indirection and faster instructions.

This is looked like an interesting optimization, but turns out to be too sensitive to unrelated code changes.
With a slightly changed benchmark:
http://llvm.org/viewvc/llvm-project?view=revision&revision=353407
This change shows ~2% slowdown instead: from 6s to 6.15s.

Did another round of benchmarking of this change on the current HEAD using these 2 benchmarks:
https://github.com/llvm-mirror/compiler-rt/blob/master/lib/tsan/benchmarks/mop.cc
https://github.com/llvm-mirror/compiler-rt/blob/master/lib/tsan/benchmarks/func_entry_exit.cc

Here fibers is this change, and fibers* is this change with 2 additional changes:

  1. cur_thread_fast in __tsan_func_entry.
  2. cur_thread1 embed in the ThreadState. The reason for this is that cur_thread1 and ThreadState are currently separated and in different cache lines (cur_thread is allocated with ALIGN(64) attribute and all hot members are in the beginning of the struct). So what I did is:

    struct ThreadState { FastState fast_state; u64 fast_synch_epoch; ThreadState* current; // <----- ADD THIS ...

    INLINE ThreadState *cur_thread_fast() { //return cur_thread1; return reinterpret_cast<ThreadState *>(cur_thread_placeholder)->current; }

Now ~2% slowdown on highly synthetic benchmarks looks like something we can tolerate (2 cases are actually faster).

The cur_thread/cur_thread_fast separation still looks confusing to me. It's a convoluted way to do lazy initialization. If one adds any new calls to these functions, which one to choose is non-obvious. I think we should do lazy initialization explicitly. Namely, leave cur_thread alone, don't introduce cur_thread_fast, don't change any call sites. Instead, add init_cur_thread call that does lazy initialization to interceptor entry point and any other points that we expect can be the first call into tsan runtime overall or within a new thread. I think interceptors and tsan_init should be enough (no tsan_func_entry). We call __tsan_init from .preinit_array, instrumented code can't be executed before .preinit_array, only interceptors from dynamic loader can precede .preinit_array callbacks.

With these 3 changes, it looks good to me and I am ready to merge it.

To clarify the graph: it's difference in execution time in percents as compared to the current HEAD. I.e. -4 means that fibers are 4% slower than the current HEAD.
1w means mop.cc benchmark with 1-byte write accesses; 4r - 4-byte read accesses. ee is func_entry_exit.cc benchmark.

yuri updated this revision to Diff 185949.Feb 8 2019, 5:23 AM
  • Replaced cur_thread1 with ThreadState::current
  • Removed cur_thread_fast() and replaced calls with cur_thread()
  • Added cur_thread_init()
yuri added a comment.EditedFeb 8 2019, 5:28 AM

The cur_thread/cur_thread_fast separation still looks confusing to me. It's a convoluted way to do lazy initialization. If one adds any new calls to these functions, which one to choose is non-obvious. I think we should do lazy initialization explicitly. Namely, leave cur_thread alone, don't introduce cur_thread_fast, don't change any call sites. Instead, add init_cur_thread call that does lazy initialization to interceptor entry point and any other points that we expect can be the first call into tsan runtime overall or within a new thread. I think interceptors and tsan_init should be enough (no tsan_func_entry). We call __tsan_init from .preinit_array, instrumented code can't be executed before .preinit_array, only interceptors from dynamic loader can precede .preinit_array callbacks.

For now I added calls to cur_thread_init() into 3 places. It was enough to pass all tests on my system. I am not sure if it will work with different versions of glibc. What do you think about it?

yuri updated this revision to Diff 186145.Feb 10 2019, 2:07 AM

Rebased on current trunk

The change is now a very good shape.

For now I added calls to cur_thread_init() into 3 places. It was enough to pass all tests on my system. I am not sure if it will work with different versions of glibc. What do you think about it?

I've tested on 2 more distributions and it worked for me.
However, for thread initialization we rely on the fact that pthread's thread entry function (start_thread) calls _setjmp which we intercept. If start_thread does not call _setjmp then we risk to hit NULL cur_thread in secondary threads. I think we need to add cur_thread_init also to:

  1. SCOPED_INTERCEPTOR_RAW (this will ensure that if start_thread calls any intercepted libc function, we will init at that point)
  2. tsan_thread_start_func (an analog of tsan_init for secondary threads, this function is not instrumented, so even if start_thread does not call any intercepted functions we still can initialize at that point without slowing down __tsan_func_entry)
yuri added a comment.Feb 11 2019, 4:10 AM

However, for thread initialization we rely on the fact that pthread's thread entry function (start_thread) calls _setjmp which we intercept. If start_thread does not call _setjmp then we risk to hit NULL cur_thread in secondary threads. I think we need to add cur_thread_init also to:

  1. SCOPED_INTERCEPTOR_RAW (this will ensure that if start_thread calls any intercepted libc function, we will init at that point)

There is a lot if interceptors that do

if (cur_thread()->in_symbolizer)

before SCOPED_INTERCEPTOR_RAW. What to do with them?

However, for thread initialization we rely on the fact that pthread's thread entry function (start_thread) calls _setjmp which we intercept. If start_thread does not call _setjmp then we risk to hit NULL cur_thread in secondary threads. I think we need to add cur_thread_init also to:

  1. SCOPED_INTERCEPTOR_RAW (this will ensure that if start_thread calls any intercepted libc function, we will init at that point)

There is a lot if interceptors that do

if (cur_thread()->in_symbolizer)

before SCOPED_INTERCEPTOR_RAW. What to do with them?

Yikes! Good question!

If we are in the symbolize we've already initialized cur_thread, since we are coming recursively from runtime. But this does not help because if we are not in symbolizer, we can have cur_thread not initialized...

We have it in malloc, atexit and similar fundamental functions that can well be a function called during process or thread start.

All of in_symbolizer checks call cur_thread in the same expression rather than use some local variable, i.e. they are of the form:

if (cur_thread()->in_symbolizer)

which suggests that we should introduce a helper in_symbolizer(void) function that will incapsulate cur_thread_init and the check (probably should go into tsan_interceptors.h).
Please send a separate change that adds such helper function and then this change will just add cur_thread_init to that function.

yuri added a comment.Feb 11 2019, 11:37 PM

Patch with in_symbolizer() function: D58104

yuri updated this revision to Diff 186423.Feb 12 2019, 2:00 AM

Add more calls of cur_thread_init()

dvyukov accepted this revision.Feb 12 2019, 2:12 AM
This revision is now accepted and ready to land.Feb 12 2019, 2:12 AM
dvyukov closed this revision.Feb 12 2019, 2:14 AM

Committed in:
http://llvm.org/viewvc/llvm-project?view=revision&revision=353817

Thanks for bearing with me, this touched very sensitive parts of runtime so I did not want to crush. This adds a useful capability to ThreadSanitizer. The performance improvements resulted from this work are much appreciated too.

yuri added a comment.Feb 12 2019, 2:33 AM

Final patch (macOS support): D58110

yuri added a subscriber: lei.Feb 12 2019, 5:07 AM
yuri added a comment.Feb 12 2019, 5:10 AM

@lei, please help to investigate what happened in http://lab.llvm.org:8011/builders/sanitizer-ppc64be-linux/builds/11413

Is it possible get stack trace of crash?

rovka added a subscriber: rovka.Feb 13 2019, 3:21 AM

Hi Yuri,

I think this might be breaking our aarch64-full buildbot [1]. I ran compiler-rt/test/sanitizer_common/tsan-aarch64-Linux/Linux/Output/clock_gettime.c.tmp and get this stack trace [2]:

Any ideas?

The bot has been red for quite a while now, so I think I will have to revert this while we investigate.

[1] http://lab.llvm.org:8011/builders/clang-cmake-aarch64-full/builds/6556
[2] https://pastebin.com/1cr5hMD2

yuri added a comment.Feb 13 2019, 3:46 AM

Hi Yuri,

I think this might be breaking our aarch64-full buildbot [1]. I ran compiler-rt/test/sanitizer_common/tsan-aarch64-Linux/Linux/Output/clock_gettime.c.tmp and get this stack trace [2]:

Any ideas?

Please see fix in D58171

Resubmitted in 353947 with 2 fixes squashed.

  • Forwarded message ---------

From: Yi-Hong Lyu via Phabricator <reviews@reviews.llvm.org>
Date: Wed, Feb 13, 2019 at 5:33 PM
Subject: [Diffusion] rL353817: tsan: add fiber support

Yi-Hong.Lyu added a comment.
This patch seems cause failure of buildbot http://lab.llvm.org:8011/builders/clang-ppc64le-linux-multistage/builds/9150

I see a similar failure was already reported:
https://reviews.llvm.org/D54889#1394632
There was no fix for it, right? Or did I forgot to squash something else?

yuri added a comment.Feb 13 2019, 8:47 AM

Looks like it is already fixed. See http://lab.llvm.org:8011/builders/clang-ppc64le-linux-multistage/builds/9157

yuri added a comment.Feb 13 2019, 8:51 AM

I see a similar failure was already reported:
https://reviews.llvm.org/D54889#1394632
There was no fix for it, right? Or did I forgot to squash something else?

It is also fixed: http://lab.llvm.org:8011/builders/sanitizer-ppc64be-linux/builds/11427