This is an archive of the discontinued LLVM Phabricator instance.

Fiber support for thread sanitizer
ClosedPublic

Authored by yuri on Nov 26 2018, 2:07 AM.

Details

Reviewers
dvyukov
blastrock
Summary

This patch adds functions for managing fibers:

  • __tsan_get_current_fiber()
  • __tsan_create_fiber()
  • __tsan_destroy_fiber()
  • __tsan_switch_to_fiber()
  • __tsan_set_fiber_name()

Diff Detail

Event Timeline

yuri created this revision.Nov 26 2018, 2:07 AM
Herald added subscribers: Restricted Project, llvm-commits, jfb and 2 others. · View Herald TranscriptNov 26 2018, 2:07 AM

Can you explain what this patch does a bit? The way I see it, it's just adding an extra level of indirection to each cur_thread() call. Is that it? Does that affect the performance of TSan?

Can we have some tests added that show what exactly can TSan now achieve with fibers?

If the approach gets approved by @dvyukov, I'd like to make sure that the work can be re-used to implement proper understanding of Darwin GCD queues.

It's unclear how this is supposed to be used. I would expect to see an interceptor for swapcontext that will make use of this, but there is none...
If these annotations are supposed to be user callable (are they? why?), we also need interface declarations in:
https://github.com/llvm-mirror/compiler-rt/blob/master/include/sanitizer/tsan_interface.h
We also need tests (lots) that demonstrate how this can be used and that this works.

I'd like to make sure that the work can be re-used to implement proper understanding of Darwin GCD queues.

What kind of API is required for GCD queues?
A while ago I sketched something for tasking support:
https://codereview.appspot.com/6904068/diff/6001/rtl/tsan_interface_task.h
It would be nice to have tasking support too (nobody programs in threads today), but I don't yet know how close this fibers support to tasking support.

yuri added a comment.Nov 27 2018, 5:07 AM

Can you explain what this patch does a bit? The way I see it, it's just adding an extra level of indirection to each cur_thread() call. Is that it? Does that affect the performance of TSan?

This patch adds indirection for cur_thread() on Linux, there is no way to do implement feature without it. On macOS indirection already exist, nothing has changed.

While preparing this patch I did my best to not affect performance of programs that do not need new API.

yuri added a comment.Nov 27 2018, 5:21 AM

It's unclear how this is supposed to be used. I would expect to see an interceptor for swapcontext that will make use of this, but there is none...

Interceptor for swapcontext without additional sanitizer API is not enough because it is not known when context is created and destroyed. Actually, I am not sure if swapcontext is actually used by someone because it is slow and has strange limitations.

My interface assumes is that users should call __tsan_switch_to_fiber() immediately before switching context and then call swapcontext or whatever he want to perform actual switch.

If these annotations are supposed to be user callable (are they? why?), we also need interface declarations in:
https://github.com/llvm-mirror/compiler-rt/blob/master/include/sanitizer/tsan_interface.h

Interface change is already present in diff.

We also need tests (lots) that demonstrate how this can be used and that this works.

OK, I will work on it and add some tests.

It's unclear how this is supposed to be used. I would expect to see an interceptor for swapcontext that will make use of this, but there is none...

Interceptor for swapcontext without additional sanitizer API is not enough because it is not known when context is created and destroyed. Actually, I am not sure if swapcontext is actually used by someone because it is slow and has strange limitations.

My interface assumes is that users should call __tsan_switch_to_fiber() immediately before switching context and then call swapcontext or whatever he want to perform actual switch.

What do people use nowadays?

Will swapcontext work along with makecontext interceptor?
swapcontext is still used, and it's also a good way to show how the API should be used and to write tests.

If these annotations are supposed to be user callable (are they? why?), we also need interface declarations in:
https://github.com/llvm-mirror/compiler-rt/blob/master/include/sanitizer/tsan_interface.h

Interface change is already present in diff.

Missed it.
But also comments. Perhaps I did not expect that interface declarations can take that few lines when scanned the change.

We also need tests (lots) that demonstrate how this can be used and that this works.

OK, I will work on it and add some tests.

yuri added a comment.Nov 27 2018, 5:51 AM

I'd like to make sure that the work can be re-used to implement proper understanding of Darwin GCD queues.

What kind of API is required for GCD queues?
A while ago I sketched something for tasking support:
https://codereview.appspot.com/6904068/diff/6001/rtl/tsan_interface_task.h
It would be nice to have tasking support too (nobody programs in threads today), but I don't yet know how close this fibers support to tasking support.

Proposed fiber API is well suitable for GCD tasks:

  1. When task in concurrent queue is created, call __tsan_create_fiber() and wrap task with custom task callback
  2. When callback gets called:
    • Call __tsan_get_current_fiber() and save the result
    • Call __tsan_switch_to_fiber() for fiber created on first step
    • Execute actual task
    • Call __tsan_switch_to_fiber() with saved fiber
    • Call __tsan_destroy_fiber()

I will add test that covers such scenario

yuri added a comment.Nov 27 2018, 6:32 AM

It's unclear how this is supposed to be used. I would expect to see an interceptor for swapcontext that will make use of this, but there is none...

Interceptor for swapcontext without additional sanitizer API is not enough because it is not known when context is created and destroyed. Actually, I am not sure if swapcontext is actually used by someone because it is slow and has strange limitations.

My interface assumes is that users should call __tsan_switch_to_fiber() immediately before switching context and then call swapcontext or whatever he want to perform actual switch.

What do people use nowadays?

We use custom assembler code. You can see it at https://github.com/acronis/pcs-io/blob/master/libpcs_io/pcs_ucontext.c
Previously we used sigsetjmp()/siglongjmp().
QEMU use sigsetjmp()/siglongjmp().
Boost use swapcontext().

Will swapcontext work along with makecontext interceptor?
swapcontext is still used, and it's also a good way to show how the API should be used and to write tests.

It is possible to implement interceptors for swapcontext() and makecontext(), but user still have to call something when context is destroyed.

Actually, such interceptors are not needed for everyone. If application execute fibers in single thread, then it does not need any special support.
Fiber API is only useful for cases when fiber can be dispatched in any thread of the thread pool (like in Go).

I think single threaded programs would also benefit from swapcontext support. At the very least it will give correct stacks. I think tsan shadow stack can also overflow on some fibers patterns (if a fiber has uneven number of function entries and exits).

I also wonder if it's theoretically possible to support fibers done with longjmp. I did fibers in Relacy with makecontext+swapcontext to do an initial switch to a new fiber, and then setjmp/longjmp to switch between already running fibers. If that's the common pattern, then we could see in longjmp that we are actually switching to a different stack and do fiber switch.
No need to do this initially, I am just wondering if we can support more real programs without annotations (qemu developers also hit unsupported fibers in tsan).

Re cleanup, we could do it the same way we cleanup atomic variables. Each atomic variable hold a large context object in tsan, but there are no explicit destructors for atomic variables (and it probably would not be feasible to ask users to annotate each atomic var). So we should have most of the required machinery already.

yuri updated this revision to Diff 176359.Dec 3 2018, 4:13 AM
  • Excluded macOS support in order to reduce patch size
  • Added comment in tsan_interface.h
  • Added tests
yuri added a comment.Dec 3 2018, 4:23 AM

I think single threaded programs would also benefit from swapcontext support. At the very least it will give correct stacks. I think tsan shadow stack can also overflow on some fibers patterns (if a fiber has uneven number of function entries and exits).

Got it.

I also wonder if it's theoretically possible to support fibers done with longjmp. I did fibers in Relacy with makecontext+swapcontext to do an initial switch to a new fiber, and then setjmp/longjmp to switch between already running fibers. If that's the common pattern, then we could see in longjmp that we are actually switching to a different stack and do fiber switch.
No need to do this initially, I am just wondering if we can support more real programs without annotations (qemu developers also hit unsupported fibers in tsan).

Maybe disabling interceptors before setmp/longjmp and enabling after it will help?

Re cleanup, we could do it the same way we cleanup atomic variables. Each atomic variable hold a large context object in tsan, but there are no explicit destructors for atomic variables (and it probably would not be feasible to ask users to annotate each atomic var). So we should have most of the required machinery already.

Are sure it is good solution?

  • Fiber context is much larger then context of atomic
  • It is not possible to correctly recreate fiber context with shadow stack if it was lost

QEMU already have annotations for Address Sanitizer at points of fiber switch. It is not a big problem to add annotations for Thread Sanitizer at the same places.

Some high-level comments:

  1. Do we want to synchronize fibers on switch? If fibers use TLS data, or cooperative scheduling (e.g. mutex unlock directly schedules the next critical section owner), or pass some data from prev fiber to next fiber, in all these cases not synchronizing fibers on switch will lead to false positives. As far as I remember in my old experiments I concluded that it's almost impossible to get rid of false positives it tasks/fibers are not synchronized on switch.
  1. This introduces slowdown into the hottest runtime functions (branch and indirection in cur_thread).

I think check_analyze.sh test will fail. Kinda hard to say, because it's broken already at the moment, but this perturbs runtime functions enough:

Before:

  write1 tot 367; size 1425; rsp 8; push 1; pop 5; call 2; load 21; store 13; sh  40; mov  94; lea   4; cmp  39
  write2 tot 373; size 1489; rsp 8; push 1; pop 5; call 2; load 21; store 13; sh  40; mov  95; lea   4; cmp  43
  write4 tot 374; size 1457; rsp 8; push 1; pop 5; call 2; load 21; store 13; sh  40; mov  95; lea   4; cmp  43
  write8 tot 364; size 1429; rsp 8; push 1; pop 5; call 2; load 21; store 13; sh  40; mov  95; lea   4; cmp  39
   read1 tot 409; size 1674; rsp 8; push 1; pop 4; call 2; load 21; store 13; sh  45; mov  97; lea   4; cmp  39
   read2 tot 412; size 1694; rsp 8; push 1; pop 4; call 2; load 21; store 13; sh  45; mov  97; lea   4; cmp  43
   read4 tot 412; size 1694; rsp 8; push 1; pop 4; call 2; load 21; store 13; sh  45; mov  97; lea   4; cmp  43
   read8 tot 404; size 1660; rsp 8; push 1; pop 4; call 2; load 21; store 13; sh  45; mov  97; lea   4; cmp  39
func_entry tot  42; size  165; rsp 0; push 0; pop 0; call 1; load  4; store  2; sh   3; mov  12; lea   1; cmp   1
 func_exit tot  37; size  149; rsp 0; push 0; pop 0; call 1; load  2; store  1; sh   3; mov   9; lea   1; cmp   1

After:

  write1 tot 367; size 1439; rsp 4; push 1; pop 5; call 2; load 21; store 14; sh  40; mov  93; lea   4; cmp  40
  write2 tot 375; size 1493; rsp 4; push 1; pop 5; call 2; load 21; store 14; sh  40; mov  94; lea   4; cmp  44
  write4 tot 375; size 1471; rsp 4; push 1; pop 5; call 2; load 21; store 14; sh  40; mov  94; lea   4; cmp  44
  write8 tot 365; size 1435; rsp 4; push 1; pop 5; call 2; load 21; store 14; sh  40; mov  94; lea   4; cmp  40
   read1 tot 412; size 1682; rsp 4; push 1; pop 4; call 2; load 21; store 14; sh  45; mov  96; lea   4; cmp  40
   read2 tot 414; size 1698; rsp 4; push 1; pop 4; call 2; load 21; store 14; sh  45; mov  96; lea   4; cmp  44
   read4 tot 414; size 1698; rsp 4; push 1; pop 4; call 2; load 21; store 14; sh  45; mov  96; lea   4; cmp  44
   read8 tot 404; size 1642; rsp 4; push 1; pop 4; call 2; load 21; store 14; sh  45; mov  96; lea   4; cmp  40
func_entry tot  48; size  197; rsp 0; push 0; pop 0; call 1; load  4; store  3; sh   3; mov  14; lea   1; cmp   2
 func_exit tot  45; size  181; rsp 0; push 0; pop 0; call 1; load  2; store  2; sh   3; mov  11; lea   1; cmp   2
  1. I think instead of __tsan_get_current_fiber we could use pointer 0. It prevents things like switching to context of another thread, which won't end well. And just simplifies API surface.
  1. What if a fiber calls pthread_exit?
  1. Can we combine tsan_set_fiber_name with tsan_create_fiber?
  1. I think it's better to add flags argument to tsan_create_fiber/tsan_switch_to_fiber to not end up with tsan_create_fiber2 and tsan_create_fiber3.
  1. Do we want to support nested fibers/tasks? Does GCD have nested tasks? I think TBB/OpenMP does. Won't begin/end fiber/task be a more generic API for this? E.g. fiber switching can be expressed as stop running the prev fiber (which switches back to the physical thread context), then start running new fiber.
  1. How does this play with setjmp/longjmp? In particular with fiber switching based on setjmp/longjmp? jmp_bufs is part of ThreadState, won't we always try to do longjmp from a wrong ThreadState in such case?
  1. Looking at 1, 2, 3, 7 and 8, it seems that we may want to switch some smaller part (subset) of ThreadState. E.g. introduce TaskContext that contains only the part that we need to switch on fiber switches and then have:

ThreadState -> TaskContext (of physical thread, always linked) -> TaskContext (current fiber context)
or:
ThreadState -> TaskContext (of physical thread, always linked) -> TaskContext (of a task) -> TaskContext (of a nested task)
E.g. clock can stay in ThreadState (if we want to synchronize them) which will solve the performance problem, jmp_bufs can stay which will solve part of setjmp problem. This may resolve the hacky_call problem as well.

dvyukov added inline comments.Dec 3 2018, 8:24 AM
lib/tsan/rtl/tsan_rtl_thread.cc
62

args->thr is parent (current) thread (?), so why can't we use hacky call?

115

I think we should execute start on the fiber itself. This is what we do for normal threads. This should resolve the hacky call problem (right?) and also prevent fibers from breaking in future.
We can do this either by briefly switching to the fiber context and back when we create a new fiber. Or, start it lazily on first switch.

138

Fibers created detached. How does it happen that fibers affect this?

yuri marked 3 inline comments as done.EditedDec 10 2018, 4:46 AM

Hi Dmitry,
I will try to answer your questions

  1. Do we want to synchronize fibers on switch? If fibers use TLS data, or cooperative scheduling (e.g. mutex unlock directly schedules the next critical section owner), or pass some data from prev fiber to next fiber, in all these cases not synchronizing fibers on switch will lead to false positives. As far as I remember in my old experiments I concluded that it's almost impossible to get rid of false positives it tasks/fibers are not synchronized on switch.

In case of single-threaded scheduler, it is worth to synchronize fibers on each switch. Currently I do it in client code using _tsan_release()/_tsan_acquire(), but it is possible to add flag for _tsan_switch_to_fiber() that will do it.

In case of multithreaded scheduler, client probably has own sychronization primitives (for example, custom mutex), and it is client's responsibility to call corresponding TSAN functions.

  1. This introduces slowdown into the hottest runtime functions (branch and indirection in cur_thread).

I think check_analyze.sh test will fail. Kinda hard to say, because it's broken already at the moment, but this perturbs runtime functions enough:

I have checked parts of the patch separately and found:

  • Part of patch related to HACKY_CALL does not affect code of critical functions and benchmark results
  • Part of patch related to proc() does not affect code of critical functions and benchmark results
  • The only part that affects critical functions is implementation cur_thread(). I tried to minimize its effect in new version of a patch.
  1. I think instead of _tsan_get_current_fiber we could use pointer 0. It prevents things like switching to context of another thread, which won't end well. And just simplifies API surface.

For us possibility to use context of another thread is required feature. We use it to call into library which use fibers from external finber-unaware code.
It is on-par with Windows API for fibers - ConvertThreadToFiber().

  1. What if a fiber calls pthread_exit?

I think, it's enough to add check and abort until someone will really need support for it

  1. Can we combine _tsan_set_fiber_name with _tsan_create_fiber?

It will not solve our requirements. It is convenient to set fiber name from the fiber itself and even change it sometimes, analogously to pthread_setname_np().

  1. I think it's better to add flags argument to _tsan_create_fiber/_tsan_switch_to_fiber to not end up with _tsan_create_fiber2 and _tsan_create_fiber3.

OK

  1. Do we want to support nested fibers/tasks? Does GCD have nested tasks? I think TBB/OpenMP does. Won't begin/end fiber/task be a more generic API for this? E.g. fiber switching can be expressed as stop running the prev fiber (which switches back to the physical thread context), then start running new fiber.

GCD has no nested task. GCD tasks can start other tasks, but can't switch or wait.

For me current API looks more generic then what you suggest.
Also, switch to physical thread context may be problematic in some case (see answer to question 3).

  1. How does this play with setjmp/longjmp? In particular with fiber switching based on setjmp/longjmp? jmp_bufs is part of ThreadState, won't we always try to do longjmp from a wrong ThreadState in such case?

My idea is when setjmp/longjmp is used for fiber switch, client should disable interceptors before call and enable them after it. This procedure can be simplified by flag of _tsan_switch_to_fiber().

  1. Looking at 1, 2, 3, 7 and 8, it seems that we may want to switch some smaller part (subset) of ThreadState. E.g. introduce TaskContext that contains only the part that we need to switch on fiber switches and then have:

ThreadState -> TaskContext (of physical thread, always linked) -> TaskContext (current fiber context)
or:
ThreadState -> TaskContext (of physical thread, always linked) -> TaskContext (of a task) -> TaskContext (of a nested task)
E.g. clock can stay in ThreadState (if we want to synchronize them) which will solve the performance problem, jmp_bufs can stay which will solve part of setjmp problem. This may resolve the hacky_call problem as well.

I prefer not to do it:

  • There is already similar separation ThreadState/Process. If something is not needed in ThreadState, it can be moved to Process.
  • In case of single-threaded scheduler the main thing to switch is shadow stack. In case of multithreaded scheduler nearly everything should be switched. I do not think it is good idea to have two different implementations for these cases.
  • I tried to be on-par with Go sanitizer, which uses ThreadState for each goroutine and single Process per physical thread
lib/tsan/rtl/tsan_rtl_thread.cc
62

You are right, only TraceAddEvent() call inside OnStarted() is a problem

115

Start indeed can be executed on the first call to __tsan_switch_to_fiber(), but this solution is not suitable for destroying fiber, because it is normal for fiber to cycle in infinite loop and be destroyed externally.
Do you think it is safe to temporary switch to fiber context for calling ThreadStart and/or ThreadFinish without actually switch to fiber stack?

138

OK, I missed this

dvyukov added inline comments.Dec 10 2018, 5:05 AM
lib/tsan/rtl/tsan_rtl_thread.cc
115

Do we use trace in ThreadFinish?
If not, then it's not a problem. If yes, then switching to the fiber looks fine to me and better then changing non-local code invariants.

If we do this, do we still need any hacky call changes?

yuri marked an inline comment as done.Dec 10 2018, 5:11 AM
yuri added inline comments.
lib/tsan/rtl/tsan_rtl_thread.cc
115

There is no trace in ThreadFinish for detached thread, but it uses proc(), so it is better to switch temporary.
I think, hacky call changed can be reverted. I will check it.

yuri updated this revision to Diff 177490.Dec 10 2018, 5:39 AM
  • Temporary switch into fiber context for start and stop instead of avoiding HACKY_CALL
  • Minimized perfomance impact of cur_thread() change
  • Added (currently unused) flags to _tsan_create_fiber() and _tsan_switch_to_fiber()

FTR, Lingfeng did prototype change to qemu that uses these annotations:
https://android-review.googlesource.com/c/platform/external/qemu/+/844675

741g added a subscriber: 741g.Dec 10 2018, 8:34 AM

FTR, Lingfeng did prototype change to qemu that uses these annotations:
https://android-review.googlesource.com/c/platform/external/qemu/+/844675

Thanks for the mention Dmitry. yuri, that change points to the needed changes to support TSAN with QEMU using the patch set before the latest upload today. We've tried it and it seems to work with the full emulator, but there are a lot of false positive messages and we think that it's possible to not require annotations in QEMU if interceptors for ucontext API were added here, and a refactoring was done to make sure jmpbufs did not get changed on all fiber switches.

The problem is that QEMU has this way of implementing coroutines:

Calling thread, qemu_coroutine_new:

  • getcontext, makecontext with coroutine_trampoline func
  • sigsetjmp
  • swapcontext (in same thread)

In the new context, in coroutine_trampoline:

  • sigsetjmp
  • siglongjmp back to qemu_coroutine_new (not swapcontext)
  • qemu_coroutine_new finishes

Later on, in the QEMU code when switching to that coroutine, in qemu_coroutine_switch:

  • sigsetjmp
  • siglongjmp to coroutine_trampoline in the other branch after the previous sigsetjmp there, run co->entry
  • qemu_coroutine_switch itself is called to switch back to the "caller" coroutine (since they can be nested I suppose)

The current way suggests to insert fiber creation/switching aroudn getcontext and swapcontext. However, since swapcontext is used in such conjunction with longjmp, this creates an issue where one also needs to add fiber switch calls to longjmp where previously, we didn't have to do that before.

That's because the current way of switching fibers identifies fibers with the complete state of a cur_thread, so fiber switch + swapcontext may also unintendedly exchange the jmpbufs as well. At least for this use case, it looks like we need to separate jmp bufs from fiber switches, and generally separate fibers from physical threads, because they differ at least in the set of jmpbufs that must be preserved.

  1. Physical thread switch: jmpbufs are not preserved
  2. Fiber switch on the same physical thread. jmpbufs are preserved.

We also found that there were a lot of false positive warnings generated by QEMU due to switching fibers on the same physical thread + updating common state. TSAN should be aware of fiber switches that are restricted to one physical thread and consider the operations implicitly synchronized. This is another reason to separate their state.

To summarize:

The set of jumpbufs should stay invariant when changing "fibers" via setjmp/longjmp.

The set of jmpbufs is not necessarily invariant under swapcontext, because swapcontext can be done across multiple physical threads, while setjmp/longjmp must remain on one thread.

However, qemu runs setjmp -> swapcontext -> longjmp all in one thread; the physical thread should be tracked for those calls and it should allow access to the jmpbuf set.

For example, if we swapcontext in another physical thread we cannot use the same jmpbufs, but if we swapcontext in one thread it should use the current jmpbufs. This should allow that logic to work for QEMU at least, and it feels like it should work for all possible other uses (?)

On fiber API versus intercepting swapcontext/makecontext: Maybe we can have both.

Lingfeng, thanks for extensive feedback.

This should allow that logic to work for QEMU at least, and it feels like it should work for all possible other uses (?)

I am not sure there are no programs that use setjmp/longjmp to switch fibers across threads. Yes, man says behavior is undefined. But strictly saying, longjmp shouldn't be used for fiber switching too (it's solely for jumping upwards on the same stack). So I feel people are just relying on implementation details here (let's test it, works, ok, let's use it).

You said that QEMU uses sigsetjmp/siglongjmp for coroutine switches. How does a coroutine always stay on the same thread then? Say, if they have a thread pool servicing all coroutines, I would expect that coroutines are randomly intermixed across threads. So it can happen that we setjmp on one thread, but then reschedule the coroutine with longjmp on another thread.

If it's the case, then it's quite unpleasant b/c it means we need a global list of jmpbufs. And that won't work with the current jmpbug garbage collection logic and sp level checks. So we will probably need to understand that this particular jmpbuf is a fiber and then handle it specially.

Hi Dmitry,
I will try to answer your questions

  1. Do we want to synchronize fibers on switch? If fibers use TLS data, or cooperative scheduling (e.g. mutex unlock directly schedules the next critical section owner), or pass some data from prev fiber to next fiber, in all these cases not synchronizing fibers on switch will lead to false positives. As far as I remember in my old experiments I concluded that it's almost impossible to get rid of false positives it tasks/fibers are not synchronized on switch.

In case of single-threaded scheduler, it is worth to synchronize fibers on each switch. Currently I do it in client code using _tsan_release()/_tsan_acquire(), but it is possible to add flag for _tsan_switch_to_fiber() that will do it.

In case of multithreaded scheduler, client probably has own sychronization primitives (for example, custom mutex), and it is client's responsibility to call corresponding TSAN functions.

I think that synchronization will be exactly in the wrong place.
Consider, a thread pushes a fiber for execution on a thread pool. It locks a mutex and adds the fiber onto run queue. Then a thread pool thread dequeues the fiber from the run queue under the same mutex. Now the first thread and the thread pool thread are synchronized. Now the thread pool switches to the fiber context, no synchronization happens, so the fiber is _not_ synchronized with the submitting thread. This means that any passed data will cause false positives.
Even if the fiber is just created, in the above scheme even reading fiber entry function arguments (and even just writing to the stack) will cause false positives.

If fiber switch will synchronize parent thread and the fiber, then we will get the desired transitive synchronization.

Looking at the race reports produces by the qemu prototype, it seems there are actually all kinds of these false reports. There are false races on TLS storage. There are races on fiber start function. There are numerous races on arguments passed to fibers.

You said that for your case you need the synchronization, and it seems that it's also needed for all other cases too. I think we need to do it. People will get annotations wrong, or don't think about acquire/release at all, and then come to mailing list with questions about all these races. If it will prove to be needed, we can always add an opt-out of synchronization with a new flag for the annotations.

  1. This introduces slowdown into the hottest runtime functions (branch and indirection in cur_thread).

I think check_analyze.sh test will fail. Kinda hard to say, because it's broken already at the moment, but this perturbs runtime functions enough:

I have checked parts of the patch separately and found:

  • Part of patch related to HACKY_CALL does not affect code of critical functions and benchmark results
  • Part of patch related to proc() does not affect code of critical functions and benchmark results
  • The only part that affects critical functions is implementation cur_thread(). I tried to minimize its effect in new version of a patch.

Yes, cur_thread() is the problem. It's used in all the hottest functions.
As far as I see the current code still contains an indirection in cur_thread(). We need to think of a way to not affect hot paths.

  1. Looking at 1, 2, 3, 7 and 8, it seems that we may want to switch some smaller part (subset) of ThreadState. E.g. introduce TaskContext that contains only the part that we need to switch on fiber switches and then have:

ThreadState -> TaskContext (of physical thread, always linked) -> TaskContext (current fiber context)
or:
ThreadState -> TaskContext (of physical thread, always linked) -> TaskContext (of a task) -> TaskContext (of a nested task)
E.g. clock can stay in ThreadState (if we want to synchronize them) which will solve the performance problem, jmp_bufs can stay which will solve part of setjmp problem. This may resolve the hacky_call problem as well.

I prefer not to do it:

  • There is already similar separation ThreadState/Process. If something is not needed in ThreadState, it can be moved to Process.

You mean Processor?
Processor is completely different thing, it does not have any relation to user-visible state, it's merely a helper and implementation detail that makes internal machinery more efficient. ThreadState is a user-visible entity. Nothing can be moved from ThreadState to Processor (except for stuff that is meant to be in Processor, but wasn't moved with the introduction of Processor, but that's a different story, it does not relate to what we discuss here).

  • In case of single-threaded scheduler the main thing to switch is shadow stack. In case of multithreaded scheduler nearly everything should be switched. I do not think it is good idea to have two different implementations for these cases.

Not if we always synchronize on fiber switch.
If we synchronize then clock does not need to be switched. And clock is exactly the thing that we need embed in TLS to not slow down hot paths.

  • I tried to be on-par with Go sanitizer, which uses ThreadState for each goroutine and single Process per physical thread

Go is different. Processor is not user visible, ThreadState is.
Go model is exactly the same as we have for C/C++ without this change: both have only 1 user-visible entity: thread. With this change C/C++ has 2 user-visible entities: thread and fiber.

  1. I think instead of _tsan_get_current_fiber we could use pointer 0. It prevents things like switching to context of another thread, which won't end well. And just simplifies API surface.

For us possibility to use context of another thread is required feature. We use it to call into library which use fibers from external finber-unaware code.
It is on-par with Windows API for fibers - ConvertThreadToFiber().

Tell me more.
You have context for all fibers, tsan_create_fiber returns it.
_tsan_get_current_fiber is only useful for the physical thread. Do you use
tsan_switch_to_fiber to switch from a fiber to a non-current physical thread? I don't see how this can work. Or what do you do?

  1. What if a fiber calls pthread_exit?

I think, it's enough to add check and abort until someone will really need support for it

Let's add a check and a test then. These things tend to lead to very obscure failure modes that take days to debug on large user programs.

  1. Can we combine _tsan_set_fiber_name with _tsan_create_fiber?

It will not solve our requirements. It is convenient to set fiber name from the fiber itself and even change it sometimes, analogously to pthread_setname_np().

I see. Let's leave it as separate function. I guess it's more flexible and will cover all possible use cases.

  1. Do we want to support nested fibers/tasks? Does GCD have nested tasks? I think TBB/OpenMP does. Won't begin/end fiber/task be a more generic API for this? E.g. fiber switching can be expressed as stop running the prev fiber (which switches back to the physical thread context), then start running new fiber.

GCD has no nested task. GCD tasks can start other tasks, but can't switch or wait.

For me current API looks more generic then what you suggest.
Also, switch to physical thread context may be problematic in some case (see answer to question 3).

Let's leave tasks aside for now. It's quite complex without tasks already.

  1. How does this play with setjmp/longjmp? In particular with fiber switching based on setjmp/longjmp? jmp_bufs is part of ThreadState, won't we always try to do longjmp from a wrong ThreadState in such case?

My idea is when setjmp/longjmp is used for fiber switch, client should disable interceptors before call and enable them after it.

What do you mean by "disable interceptors"? Do we have such thing?

This procedure can be simplified by flag of _tsan_switch_to_fiber().

How? setjmp/longjmp happen outside of __tsan_switch_to_fiber.

Thinking of how we can eliminate the indirection in cur_thread...

We could not even allocate a tid for fiber (tid space is expensive) if we record fiber switch events in the trace. Then all fibers running on a thread could reuse its tid, and we could restore fiber identity for memory accesses from trace. But it get pretty complicated at this point.

I think we sould split ThreadState into Thread part (which still can be called ThreadState) and Fiber part. ThreadState should contain at least fast_state, fast_synch_epoch, clock and maybe some of the int's and bool's (as they are easy to copy) and pointer to the current Fiber and to the physical thread Fiber. Fiber part will contain everything else. Fiber switch will switch Fiber pointer in ThreadState and copyout/copyin/update fast_state, fast_synch_epoch and clock.

This will eliminate the indirection on hot paths and will also open path to nested tasks (as Fiber can contain pointer to a nested Fiber).

dvyukov added inline comments.Dec 11 2018, 6:28 AM
lib/tsan/rtl/tsan_interface.cc
141

Bulk of this logic should go into tsan_rtl_thread.cc
This file is not meant to contain any significant logic, it's only interface part.
Please add a set of runtime functions, following current naming convention (e.g. FiberCreate FiberSwitch, etc), and forward to these functions.

lib/tsan/rtl/tsan_rtl.h
472

This looks even more expensive then before.
Previously we had just 1 pointer indirection. Now we also add a base to it, so it's another load and another add.

lib/tsan/rtl/tsan_rtl_thread.cc
249

Let's not abuse an existing variable for this. I am not sure no OS passes 0 here today, or in future somebody will decide to pass 0 as they don't have anything meaningful to pass as os_id. Let's add another flag for this.

test/tsan/fiber_asm.cc
62

We save the current fiber, which is the main thread in this test. But if we ever try to switch to it on another thread, everything will explode in a bad way, right?

test/tsan/fiber_simple.cc
17

If we move this 1 line below, we will get a race on orig_fiber, right? If yes, this is another argument towards synchronizing all fiber switches.

741g added a comment.Dec 11 2018, 9:46 AM

Lingfeng, thanks for extensive feedback.

This should allow that logic to work for QEMU at least, and it feels like it should work for all possible other uses (?)

I am not sure there are no programs that use setjmp/longjmp to switch fibers across threads. Yes, man says behavior is undefined. But strictly saying, longjmp shouldn't be used for fiber switching too (it's solely for jumping upwards on the same stack). So I feel people are just relying on implementation details here (let's test it, works, ok, let's use it).

You said that QEMU uses sigsetjmp/siglongjmp for coroutine switches. How does a coroutine always stay on the same thread then? Say, if they have a thread pool servicing all coroutines, I would expect that coroutines are randomly intermixed across threads. So it can happen that we setjmp on one thread, but then reschedule the coroutine with longjmp on another thread.

Thanks for pointing that out. I have to check more closely, but I haven't observed migration of fibers across physical threads at all in qemu. The tendency is to both create and switch the coroutine on a dedicated thread. Swapcontext is done to a trampoline function, to which one can return again later on. Kind of like a fiber pool, where reuse of a fiber means longjmping to the trampoline instead of another swapcontext.

I don't think there is a thread pool that mixes around the fibers. I guess this means we would have to annotate in the more general use case, but for QEMU, if we make the assumption that setjmp/longjmp follow the spec, while only allowing for physical thread switches in swapcontext, we should be OK (And physical thread switching in swapcontext is somethign QEMU does not do).

Lingfeng, thanks for extensive feedback.

This should allow that logic to work for QEMU at least, and it feels like it should work for all possible other uses (?)

I am not sure there are no programs that use setjmp/longjmp to switch fibers across threads. Yes, man says behavior is undefined. But strictly saying, longjmp shouldn't be used for fiber switching too (it's solely for jumping upwards on the same stack). So I feel people are just relying on implementation details here (let's test it, works, ok, let's use it).

You said that QEMU uses sigsetjmp/siglongjmp for coroutine switches. How does a coroutine always stay on the same thread then? Say, if they have a thread pool servicing all coroutines, I would expect that coroutines are randomly intermixed across threads. So it can happen that we setjmp on one thread, but then reschedule the coroutine with longjmp on another thread.

Thanks for pointing that out. I have to check more closely, but I haven't observed migration of fibers across physical threads at all in qemu. The tendency is to both create and switch the coroutine on a dedicated thread. Swapcontext is done to a trampoline function, to which one can return again later on. Kind of like a fiber pool, where reuse of a fiber means longjmping to the trampoline instead of another swapcontext.

I don't think there is a thread pool that mixes around the fibers. I guess this means we would have to annotate in the more general use case, but for QEMU, if we make the assumption that setjmp/longjmp follow the spec, while only allowing for physical thread switches in swapcontext, we should be OK (And physical thread switching in swapcontext is somethign QEMU does not do).

Makes sense.

If somebody will want to do longjmp to switch fibers across threads, we will need some additional support in tsan for that. Let's stick with Yuri and your use cases for now.

yuri updated this revision to Diff 178459.Dec 17 2018, 7:10 AM
  • By default sync fibers with each other on switch
  • Faster cur_thread(), no visible difference in benchmark results
  • Moved implementation to tsan_rtl_thread.cc
yuri added a comment.Dec 17 2018, 7:15 AM

Hi Dmitry,
I will try to answer your questions

  1. Do we want to synchronize fibers on switch? If fibers use TLS data, or cooperative scheduling (e.g. mutex unlock directly schedules the next critical section owner), or pass some data from prev fiber to next fiber, in all these cases not synchronizing fibers on switch will lead to false positives. As far as I remember in my old experiments I concluded that it's almost impossible to get rid of false positives it tasks/fibers are not synchronized on switch.

In case of single-threaded scheduler, it is worth to synchronize fibers on each switch. Currently I do it in client code using _tsan_release()/_tsan_acquire(), but it is possible to add flag for _tsan_switch_to_fiber() that will do it.

In case of multithreaded scheduler, client probably has own sychronization primitives (for example, custom mutex), and it is client's responsibility to call corresponding TSAN functions.

I think that synchronization will be exactly in the wrong place.
Consider, a thread pushes a fiber for execution on a thread pool. It locks a mutex and adds the fiber onto run queue. Then a thread pool thread dequeues the fiber from the run queue under the same mutex. Now the first thread and the thread pool thread are synchronized. Now the thread pool switches to the fiber context, no synchronization happens, so the fiber is _not_ synchronized with the submitting thread. This means that any passed data will cause false positives.
Even if the fiber is just created, in the above scheme even reading fiber entry function arguments (and even just writing to the stack) will cause false positives.

If fiber switch will synchronize parent thread and the fiber, then we will get the desired transitive synchronization.

Looking at the race reports produces by the qemu prototype, it seems there are actually all kinds of these false reports. There are false races on TLS storage. There are races on fiber start function. There are numerous races on arguments passed to fibers.

You said that for your case you need the synchronization, and it seems that it's also needed for all other cases too. I think we need to do it. People will get annotations wrong, or don't think about acquire/release at all, and then come to mailing list with questions about all these races. If it will prove to be needed, we can always add an opt-out of synchronization with a new flag for the annotations.

I added default synchronization and flag to opt-out. It would be great if someone can check this version with QEMU.

yuri added a comment.Dec 17 2018, 7:22 AM
  1. This introduces slowdown into the hottest runtime functions (branch and indirection in cur_thread).

I think check_analyze.sh test will fail. Kinda hard to say, because it's broken already at the moment, but this perturbs runtime functions enough:

I have checked parts of the patch separately and found:

  • Part of patch related to HACKY_CALL does not affect code of critical functions and benchmark results
  • Part of patch related to proc() does not affect code of critical functions and benchmark results
  • The only part that affects critical functions is implementation cur_thread(). I tried to minimize its effect in new version of a patch.

Yes, cur_thread() is the problem. It's used in all the hottest functions.
As far as I see the current code still contains an indirection in cur_thread(). We need to think of a way to not affect hot paths.

New version should be as fast as original code.

  1. Looking at 1, 2, 3, 7 and 8, it seems that we may want to switch some smaller part (subset) of ThreadState. E.g. introduce TaskContext that contains only the part that we need to switch on fiber switches and then have:

ThreadState -> TaskContext (of physical thread, always linked) -> TaskContext (current fiber context)
or:
ThreadState -> TaskContext (of physical thread, always linked) -> TaskContext (of a task) -> TaskContext (of a nested task)
E.g. clock can stay in ThreadState (if we want to synchronize them) which will solve the performance problem, jmp_bufs can stay which will solve part of setjmp problem. This may resolve the hacky_call problem as well.

IMHO it is hardly possible function entry/exit use shadow stack, memory access functions use clock. Both structure are too big to be copied on each fiber switch.

I prefer not to do it:

  • There is already similar separation ThreadState/Process. If something is not needed in ThreadState, it can be moved to Process.

You mean Processor?
Processor is completely different thing, it does not have any relation to user-visible state, it's merely a helper and implementation detail that makes internal machinery more efficient. ThreadState is a user-visible entity. Nothing can be moved from ThreadState to Processor (except for stuff that is meant to be in Processor, but wasn't moved with the introduction of Processor, but that's a different story, it does not relate to what we discuss here).

  • In case of single-threaded scheduler the main thing to switch is shadow stack. In case of multithreaded scheduler nearly everything should be switched. I do not think it is good idea to have two different implementations for these cases.

Not if we always synchronize on fiber switch.
If we synchronize then clock does not need to be switched. And clock is exactly the thing that we need embed in TLS to not slow down hot paths.

  • I tried to be on-par with Go sanitizer, which uses ThreadState for each goroutine and single Process per physical thread

Go is different. Processor is not user visible, ThreadState is.
Go model is exactly the same as we have for C/C++ without this change: both have only 1 user-visible entity: thread. With this change C/C++ has 2 user-visible entities: thread and fiber.

Can't agree with this. Only fiber is visible to user. Of cause, it is the same as thread until user switches it.

yuri added a comment.EditedDec 17 2018, 7:38 AM
  1. I think instead of _tsan_get_current_fiber we could use pointer 0. It prevents things like switching to context of another thread, which won't end well. And just simplifies API surface.

For us possibility to use context of another thread is required feature. We use it to call into library which use fibers from external finber-unaware code.
It is on-par with Windows API for fibers - ConvertThreadToFiber().

Tell me more.
You have context for all fibers, tsan_create_fiber returns it.
_tsan_get_current_fiber is only useful for the physical thread. Do you use
tsan_switch_to_fiber to switch from a fiber to a non-current physical thread? I don't see how this can work. Or what do you do?

In standalone thread we create temporary context and switch into into it. Then original thread context is posted into event loop thread, where it runs for some time together with other fibers. At some point code decides that it should migrate to original thread. Context is removed from list of fibers and original thread (running temporary context) is signalled. Thread switches into its own context and destroys temporary context.

This trick significantly simplifies implementation of bindings for library that uses fibers internally and improves debugability because stack traces are preserved after switch.

  1. What if a fiber calls pthread_exit?

I think, it's enough to add check and abort until someone will really need support for it

Let's add a check and a test then. These things tend to lead to very obscure failure modes that take days to debug on large user programs.

Currently there is no interceptor for pthread_exit(). Do you suggest to add it?

  1. Can we combine _tsan_set_fiber_name with _tsan_create_fiber?

It will not solve our requirements. It is convenient to set fiber name from the fiber itself and even change it sometimes, analogously to pthread_setname_np().

I see. Let's leave it as separate function. I guess it's more flexible and will cover all possible use cases.

  1. Do we want to support nested fibers/tasks? Does GCD have nested tasks? I think TBB/OpenMP does. Won't begin/end fiber/task be a more generic API for this? E.g. fiber switching can be expressed as stop running the prev fiber (which switches back to the physical thread context), then start running new fiber.

GCD has no nested task. GCD tasks can start other tasks, but can't switch or wait.

For me current API looks more generic then what you suggest.
Also, switch to physical thread context may be problematic in some case (see answer to question 3).

Let's leave tasks aside for now. It's quite complex without tasks already.

  1. How does this play with setjmp/longjmp? In particular with fiber switching based on setjmp/longjmp? jmp_bufs is part of ThreadState, won't we always try to do longjmp from a wrong ThreadState in such case?

My idea is when setjmp/longjmp is used for fiber switch, client should disable interceptors before call and enable them after it.

What do you mean by "disable interceptors"? Do we have such thing?

This procedure can be simplified by flag of _tsan_switch_to_fiber().

How? setjmp/longjmp happen outside of __tsan_switch_to_fiber.

Isn't it enough to place _tsan_ignore_thread_begin()/_tsan_ignore_thread_end() around setjmp/longjmp calls?

dvyukov added inline comments.Dec 18 2018, 7:18 AM
lib/tsan/rtl/tsan_rtl_thread.cc
249

Still holds.

Can't agree with this. Only fiber is visible to user. Of cause, it is the same as thread until user switches it.

Whatever we call them, Processor is for different things. It must not hold anything related to user state.

In standalone thread we create temporary context and switch into into it. Then original thread context is posted into event loop thread, where it runs for some time together with other fibers.

Oh, I see. I guess it can have some rough edges if used incorrectly, but also probably can work if used carefully.
OK, let's leave __tsan_get_current_fiber alone.

Currently there is no interceptor for pthread_exit(). Do you suggest to add it?

Yes.

Isn't it enough to place _tsan_ignore_thread_begin()/_tsan_ignore_thread_end() around setjmp/longjmp calls?

These only ignore memory accesses, they don't affect interceptor operation as far as I see.

I don't completely follow logic behind cur_thread/cur_thread_fast/cur_thread1 and how it does not introduce slowdown.
cur_thread_fast still includes an indirection, so it looks exactly the same as the first version.
cur_thread_fast does not check for NULL, so I would expect it to be used only in few carefully chosen functions. But it's used in most entry functions. Which makes me wonder: when cur_thread_fast cannot be used if it can be used in all these places?
It seems that you use cur_thread only to lazily initialize cur_thread1 in Initialize or ThreadStart. But then the question is: why don't we explicitly intialize cur_thread1 in these functions, and then just have a single cur_thread function as before with no 2 subtly different accessors?

dvyukov added inline comments.Dec 18 2018, 7:32 AM
lib/tsan/rtl/tsan_rtl.h
880

It seems it should be SwtchFiberFlags flags. Or why do we need SwtchFiberFlags?

lib/tsan/rtl/tsan_rtl_thread.cc
417

This code is an abstract core and must not use cur_thread.
Pass ThreadState *thr, uptr pc as first args to all these functions as all runtime functions do.

I added default synchronization and flag to opt-out.

Can you swicth to it in your codebase? I would expect that you need all (or almost all) of this synchronization anyway?

In standalone thread we create temporary context and switch into into it. Then original thread context is posted into event loop thread, where it runs for some time together with other fibers.

Please add such test (or we will break it in future).

IMHO it is hardly possible function entry/exit use shadow stack, memory access functions use clock. Both structure are too big to be copied on each fiber switch.

This is a good question.
Shadow stack seems to use an indirection already.
Clock copying is a sequential memory copy. If handling of memory accesses/function entry/exit becomes a bit slower, then a thousand of memory accesses can outweight cost of copying clock.
A good question is: how much slower is the code with the indirection in cur_thread?

FTR, here is current code:

00000000004b2c50 <__tsan_read2>:
  4b2c50:       48 b8 f8 ff ff ff ff    movabs $0xffff87fffffffff8,%rax
  4b2c57:       87 ff ff 
  4b2c5a:       48 ba 00 00 00 00 00    movabs $0x40000000000,%rdx
  4b2c61:       04 00 00 
  4b2c64:       53                      push   %rbx
  4b2c65:       48 21 f8                and    %rdi,%rax
  4b2c68:       48 8b 74 24 08          mov    0x8(%rsp),%rsi
  4b2c6d:       48 31 d0                xor    %rdx,%rax
  4b2c70:       48 83 3c 85 00 00 00    cmpq   $0xffffffffffffffff,0x0(,%rax,4)
  4b2c77:       00 ff 
  4b2c79:       0f 84 9d 00 00 00       je       4b2d1c <__tsan_read2+0xcc>
  4b2c7f:       49 c7 c0 c0 04 fc ff    mov    $0xfffffffffffc04c0,%r8
  4b2c86:       64 49 8b 08             mov    %fs:(%r8),%rcx
  4b2c8a:       48 85 c9                test   %rcx,%rcx
  4b2c8d:       0f 88 89 00 00 00       js       4b2d1c <__tsan_read2+0xcc>
...
  4b2cf6:       64 f3 41 0f 7e 50 08    movq   %fs:0x8(%r8),%xmm2
...
  4b2d36:       64 49 89 10             mov    %rdx,%fs:(%r8)

Here is new code:

00000000004b8460 <__tsan_read2>:
  4b8460:       48 b8 f8 ff ff ff ff    movabs $0xffff87fffffffff8,%rax
  4b8467:       87 ff ff 
  4b846a:       48 ba 00 00 00 00 00    movabs $0x40000000000,%rdx
  4b8471:       04 00 00 
  4b8474:       53                      push   %rbx
  4b8475:       48 21 f8                and    %rdi,%rax
  4b8478:       48 8b 74 24 08          mov    0x8(%rsp),%rsi
  4b847d:       48 31 d0                xor    %rdx,%rax
  4b8480:       48 83 3c 85 00 00 00    cmpq   $0xffffffffffffffff,0x0(,%rax,4)
  4b8487:       00 ff 
  4b8489:       0f 84 9f 00 00 00       je       4b852e <__tsan_read2+0xce>
  4b848f:       48 c7 c2 f8 ff ff ff    mov    $0xfffffffffffffff8,%rdx
  4b8496:       64 4c 8b 02             mov    %fs:(%rdx),%r8
  4b849a:       49 8b 08                mov    (%r8),%rcx
  4b849d:       48 85 c9                test   %rcx,%rcx
  4b84a0:       0f 88 88 00 00 00       js       4b852e <__tsan_read2+0xce>
...
  4b8509:       f3 41 0f 7e 50 08       movq   0x8(%r8),%xmm2
...
  4b8546:       49 89 10                mov    %rdx,(%r8)

The additional indirection is "mov (%r8),%rcx".

741g added a comment.Dec 18 2018, 8:25 AM

https://android-review.googlesource.com/c/platform/external/qemu/+/844675

Latest version of patch doesn't work with QEMU anymore, at least with those annotations. Error log:

qemu_coroutine_new:181:0x7f02938cf700:0x7f029388f7c0 Start new coroutine
on_new_fiber:94:0x7f02938cf700:0x7f029388f7c0 Create new TSAN co fiber. co: 0x7b4400025940 co fiber: 0x7f0292cc0008 caller fiber: 0x7f029388f7c0
start_switch_fiber:133:0x7f02938cf700:0x7f029388f7c0 Current fiber: 0x7f029388f7c0.
start_switch_fiber:135:0x7f02938cf700:0x7f029388f7c0 Switch to fiber 0x7f0292cc0008
coroutine_trampoline:142:0x7f02938cf700:0x7f0292cc0008 Start trampoline
coroutine_trampoline:157:0x7f02938cf700:0x7f0292cc0008 Current fiber: 0x7f0292cc0008. Set co 0x7b4400025940 to env 0x7b44000259a8
start_switch_fiber:133:0x7f02938cf700:0x7f0292cc0008 Current fiber: 0x7f0292cc0008.
start_switch_fiber:135:0x7f02938cf700:0x7f0292cc0008 Switch to fiber 0x7f029388f7c0
coroutine_trampoline:164:0x7f02938cf700:0x7f029388f7c0 Jump to co 0x7b4400025940 caller fiber 0x7f029388f7c0 env 0x7b4400025940
FATAL: ThreadSanitizer CHECK failed: /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:531 "((thr->shadow_stack_pos)) >= ((buf->shadow_stack_pos))" (0x60000bbe0078, 0x60000bbe0080)

#0 __tsan::TsanCheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_rtl_report.cc:48:25 (qemu-system-x86_64+0x536168)
#1 __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_termination.cc:79:5 (qemu-system-x86_64+0x4bcf4f)
#2 LongJmp(__tsan::ThreadState*, unsigned long*) /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:531:7 (qemu-system-x86_64+0x4d2311)
#3 siglongjmp /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:647:3 (qemu-system-x86_64+0x4d23ce)
#4 coroutine_trampoline /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/coroutine-ucontext.c:165:9 (qemu-system-x86_64+0xb72000)
#5 <null> <null> (libc.so.6+0x43fcf)

Do you know what I'm doing wrong here?

yuri added a comment.Dec 18 2018, 8:41 AM

https://android-review.googlesource.com/c/platform/external/qemu/+/844675

Latest version of patch doesn't work with QEMU anymore, at least with those annotations. Error log:

Do you know what I'm doing wrong here?


Please check if adding __attribute__((always_inline)) for start_switch_fiber() and finish_switch_fiber() helps

741g added a comment.Dec 18 2018, 8:50 AM

https://android-review.googlesource.com/c/platform/external/qemu/+/844675

Latest version of patch doesn't work with QEMU anymore, at least with those annotations. Error log:

Do you know what I'm doing wrong here?


Please check if adding __attribute__((always_inline)) for start_switch_fiber() and finish_switch_fiber() helps

That worked, thanks! Sorry, I had updated the patch with a refactoring the meantime that probably actually caused the break.

741g added a comment.Dec 19 2018, 1:07 PM

https://android-review.googlesource.com/c/platform/external/qemu/+/844675

Latest version of patch doesn't work with QEMU anymore, at least with those annotations. Error log:

Do you know what I'm doing wrong here?


Please check if adding __attribute__((always_inline)) for start_switch_fiber() and finish_switch_fiber() helps

That worked, thanks! Sorry, I had updated the patch with a refactoring the meantime that probably actually caused the break.

QEMU status: Fewer false positives now; there are many warnings that seem more real now, mostly about not atomically reading variables that got atomically updated. There might be other issues as well.

741g added a comment.Dec 19 2018, 1:23 PM

https://android-review.googlesource.com/c/platform/external/qemu/+/844675

Latest version of patch doesn't work with QEMU anymore, at least with those annotations. Error log:

Do you know what I'm doing wrong here?


Please check if adding __attribute__((always_inline)) for start_switch_fiber() and finish_switch_fiber() helps

That worked, thanks! Sorry, I had updated the patch with a refactoring the meantime that probably actually caused the break.

QEMU status: Fewer false positives now; there are many warnings that seem more real now, mostly about not atomically reading variables that got atomically updated. There might be other issues as well.

WARNING: ThreadSanitizer: data race (pid=3742) Atomic write of size 1 at 0x7b0c00051620 by thread T14 (mutexes: write M1097): #0 __tsan_atomic8_exchange /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interface_atomic.cc:579:3 (qemu-system-x86_64+0x51d0a8) #1 qemu_bh_schedule /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:167:9 (qemu-system-x86_64+0xb4675e) #2 worker_thread /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/thread-pool.c:114:9 (qemu-system-x86_64+0xb4838c) #3 qemu_thread_trampoline /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-thread-posix.c:551:17 (qemu-system-x86_64+0xb4fe36)
Previous read of size 1 at 0x7b0c00051620 by thread T10 (mutexes: write M985):
  #0 aio_compute_timeout /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:198:17 (qemu-system-x86_64+0xb46868)
  #1 aio_poll /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/aio-posix.c:617:26 (qemu-system-x86_64+0xb4c38c)
  #2 blk_prw /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1247:9 (qemu-system-x86_64+0x98bbbd)
  #3 blk_pread /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1409:15 (qemu-system-x86_64+0x98b91a)
  #4 find_image_format /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:701:11 (qemu-system-x86_64+0x958068)
  #5 bdrv_open_inherit /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:2690 (qemu-system-x86_64+0x958068)
  #6 bdrv_open /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:2802:12 (qemu-system-x86_64+0x958d1f)
  #7 blk_new_open /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:375:10 (qemu-system-x86_64+0x98927a)
  #8 blockdev_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../blockdev.c:598:15 (qemu-system-x86_64+0x9c5700)
  #9 drive_new /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../blockdev.c:1092 (qemu-system-x86_64+0x9c5700)
  #10 drive_init(void*, QemuOpts*, Error**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/drive-share.cpp:366:28 (qemu-system-x86_64+0xb1e9a2)
  #11 qemu_opts_foreach /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-option.c:1106:14 (qemu-system-x86_64+0xb696e5)
  #12 android_drive_share_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/drive-share.cpp:530:9 (qemu-system-x86_64+0xb1e1e0)
  #13 main_impl /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:5244:13 (qemu-system-x86_64+0x54f811)
  #14 run_qemu_main /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:3349:21 (qemu-system-x86_64+0x547a02)
  #15 enter_qemu_main_loop(int, char**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/main.cpp:606:5 (qemu-system-x86_64+0x54427c)
  #16 MainLoopThread::run() /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android/android-emu/android/skin/qt/emulator-qt-window.h:73:13 (qemu-system-x86_64+0xc40b41)
  #17 QThreadPrivate::start(void*) /usr/local/google/home/joshuaduong/qt-build/src/qt-everywhere-src-5.11.1/qtbase/src/corelib/thread/qthread_unix.cpp:367:14 (libQt5Core.so.5+0xa7d35)

Location is heap block of size 40 at 0x7b0c00051600 allocated by thread T13:
  #0 malloc /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:667:5 (qemu-system-x86_64+0x4d246c)
  #1 g_malloc /tmp/jansene-build-temp-193494/src/glib-2.38.2/glib/gmem.c:104 (qemu-system-x86_64+0xecfdc0)
  #2 thread_pool_init_one /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/thread-pool.c:307:27 (qemu-system-x86_64+0xb47a63)
  #3 thread_pool_new /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/thread-pool.c:327 (qemu-system-x86_64+0xb47a63)
  #4 aio_get_thread_pool /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:320:28 (qemu-system-x86_64+0xb46934)
  #5 paio_submit_co /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/file-posix.c:1565:12 (qemu-system-x86_64+0xa15d23)
  #6 raw_co_prw /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/file-posix.c:1620 (qemu-system-x86_64+0xa15d23)
  #7 raw_co_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/file-posix.c:1627 (qemu-system-x86_64+0xa15d23)
  #8 bdrv_driver_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/io.c:924:16 (qemu-system-x86_64+0x8d91f7)
  #9 bdrv_aligned_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/io.c:1228 (qemu-system-x86_64+0x8d91f7)
  #10 bdrv_co_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/io.c:1324:11 (qemu-system-x86_64+0x8d8dd7)
  #11 blk_co_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1158:11 (qemu-system-x86_64+0x98b0da)
  #12 blk_read_entry /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1206:17 (qemu-system-x86_64+0x98c1e3)
  #13 coroutine_trampoline /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/coroutine-ucontext.c:173:9 (qemu-system-x86_64+0xb71c8a)
  #14 <null> <null> (libc.so.6+0x43fcf)

Mutex M1097 (0x7b3800009bd0) created at:
  #0 pthread_mutex_init /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:1187:3 (qemu-system-x86_64+0x4d4ebc)
  #1 qemu_mutex_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-thread-posix.c:61:11 (qemu-system-x86_64+0xb4f2a7)
  #2 thread_pool_init_one /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/thread-pool.c:308:5 (qemu-system-x86_64+0xb47a7b)
  #3 thread_pool_new /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/thread-pool.c:327 (qemu-system-x86_64+0xb47a7b)
  #4 aio_get_thread_pool /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:320:28 (qemu-system-x86_64+0xb46934)
  #5 paio_submit_co /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/file-posix.c:1565:12 (qemu-system-x86_64+0xa15d23)
  #6 raw_co_prw /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/file-posix.c:1620 (qemu-system-x86_64+0xa15d23)
  #7 raw_co_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/file-posix.c:1627 (qemu-system-x86_64+0xa15d23)
  #8 bdrv_driver_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/io.c:924:16 (qemu-system-x86_64+0x8d91f7)
  #9 bdrv_aligned_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/io.c:1228 (qemu-system-x86_64+0x8d91f7)
  #10 bdrv_co_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/io.c:1324:11 (qemu-system-x86_64+0x8d8dd7)
  #11 blk_co_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1158:11 (qemu-system-x86_64+0x98b0da)
  #12 blk_read_entry /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1206:17 (qemu-system-x86_64+0x98c1e3)
  #13 coroutine_trampoline /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/coroutine-ucontext.c:173:9 (qemu-system-x86_64+0xb71c8a)
  #14 <null> <null> (libc.so.6+0x43fcf)

Mutex M985 (0x000003bb0fb8) created at:
  #0 pthread_mutex_init /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:1187:3 (qemu-system-x86_64+0x4d4ebc)
  #1 qemu_mutex_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-thread-posix.c:61:11 (qemu-system-x86_64+0xb4f2a7)
  #2 qemu_init_cpu_loop /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../cpus.c:1123:5 (qemu-system-x86_64+0x5ddb0c)
  #3 main_impl /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:3437:5 (qemu-system-x86_64+0x547ae0)
  #4 run_qemu_main /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:3349:21 (qemu-system-x86_64+0x547a02)
  #5 enter_qemu_main_loop(int, char**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/main.cpp:606:5 (qemu-system-x86_64+0x54427c)
  #6 MainLoopThread::run() /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android/android-emu/android/skin/qt/emulator-qt-window.h:73:13 (qemu-system-x86_64+0xc40b41)
  #7 QThreadPrivate::start(void*) /usr/local/google/home/joshuaduong/qt-build/src/qt-everywhere-src-5.11.1/qtbase/src/corelib/thread/qthread_unix.cpp:367:14 (libQt5Core.so.5+0xa7d35)

Thread T14 (tid=3886, running) created by thread T10 at:
  #0 pthread_create /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:968:3 (qemu-system-x86_64+0x4d3cda)
  #1 qemu_thread_create /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-thread-posix.c:591:11 (qemu-system-x86_64+0xb4fcef)
  #2 do_spawn_thread /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/thread-pool.c:135:5 (qemu-system-x86_64+0xb48072)
  #3 spawn_thread_bh_fn /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/thread-pool.c:143 (qemu-system-x86_64+0xb48072)
  #4 aio_bh_call /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:90:5 (qemu-system-x86_64+0xb46616)
  #5 aio_bh_poll /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:118 (qemu-system-x86_64+0xb46616)
  #6 aio_poll /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/aio-posix.c:706:17 (qemu-system-x86_64+0xb4cf45)
  #7 blk_prw /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1247:9 (qemu-system-x86_64+0x98bbbd)
  #8 blk_pread /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1409:15 (qemu-system-x86_64+0x98b91a)
  #9 find_image_format /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:701:11 (qemu-system-x86_64+0x958068)
  #10 bdrv_open_inherit /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:2690 (qemu-system-x86_64+0x958068)
  #11 bdrv_open /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:2802:12 (qemu-system-x86_64+0x958d1f)
  #12 blk_new_open /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:375:10 (qemu-system-x86_64+0x98927a)
  #13 blockdev_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../blockdev.c:598:15 (qemu-system-x86_64+0x9c5700)
  #14 drive_new /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../blockdev.c:1092 (qemu-system-x86_64+0x9c5700)
  #15 drive_init(void*, QemuOpts*, Error**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/drive-share.cpp:366:28 (qemu-system-x86_64+0xb1e9a2)
  #16 qemu_opts_foreach /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-option.c:1106:14 (qemu-system-x86_64+0xb696e5)
  #17 android_drive_share_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/drive-share.cpp:530:9 (qemu-system-x86_64+0xb1e1e0)
  #18 main_impl /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:5244:13 (qemu-system-x86_64+0x54f811)
  #19 run_qemu_main /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:3349:21 (qemu-system-x86_64+0x547a02)
  #20 enter_qemu_main_loop(int, char**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/main.cpp:606:5 (qemu-system-x86_64+0x54427c)
  #21 MainLoopThread::run() /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android/android-emu/android/skin/qt/emulator-qt-window.h:73:13 (qemu-system-x86_64+0xc40b41)
  #22 QThreadPrivate::start(void*) /usr/local/google/home/joshuaduong/qt-build/src/qt-everywhere-src-5.11.1/qtbase/src/corelib/thread/qthread_unix.cpp:367:14 (libQt5Core.so.5+0xa7d35)

Thread T10 'MainLoopThread' (tid=3881, running) created by main thread at:
  #0 pthread_create /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:968:3 (qemu-system-x86_64+0x4d3cda)
  #1 QThread::start(QThread::Priority) /usr/local/google/home/joshuaduong/qt-build/src/qt-everywhere-src-5.11.1/qtbase/src/corelib/thread/qthread_unix.cpp:726:16 (libQt5Core.so.5+0xa84fb)
  #2 skin_winsys_spawn_thread /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android/android-emu/android/skin/qt/winsys-qt.cpp:519:17 (qemu-system-x86_64+0xbcf68b)
  #3 main /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/main.cpp:1624:5 (qemu-system-x86_64+0x543f9c)

Thread T13 (tid=0, running) created by thread T10 at:
  #0 on_new_fiber /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/coroutine-ucontext.c:90:25 (qemu-system-x86_64+0xb71b48)
  #1 qemu_coroutine_new /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/coroutine-ucontext.c:217 (qemu-system-x86_64+0xb71b48)
  #2 qemu_coroutine_create /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-coroutine.c:88:14 (qemu-system-x86_64+0xb70349)
  #3 blk_prw /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1245:25 (qemu-system-x86_64+0x98baca)
  #4 blk_pread /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1409:15 (qemu-system-x86_64+0x98b91a)
  #5 find_image_format /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:701:11 (qemu-system-x86_64+0x958068)
  #6 bdrv_open_inherit /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:2690 (qemu-system-x86_64+0x958068)
  #7 bdrv_open /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:2802:12 (qemu-system-x86_64+0x958d1f)
  #8 blk_new_open /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:375:10 (qemu-system-x86_64+0x98927a)
  #9 blockdev_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../blockdev.c:598:15 (qemu-system-x86_64+0x9c5700)
  #10 drive_new /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../blockdev.c:1092 (qemu-system-x86_64+0x9c5700)
  #11 drive_init(void*, QemuOpts*, Error**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/drive-share.cpp:366:28 (qemu-system-x86_64+0xb1e9a2)
  #12 qemu_opts_foreach /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-option.c:1106:14 (qemu-system-x86_64+0xb696e5)
  #13 android_drive_share_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/drive-share.cpp:530:9 (qemu-system-x86_64+0xb1e1e0)
  #14 main_impl /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:5244:13 (qemu-system-x86_64+0x54f811)
  #15 run_qemu_main /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:3349:21 (qemu-system-x86_64+0x547a02)
  #16 enter_qemu_main_loop(int, char**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/main.cpp:606:5 (qemu-system-x86_64+0x54427c)
  #17 MainLoopThread::run() /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android/android-emu/android/skin/qt/emulator-qt-window.h:73:13 (qemu-system-x86_64+0xc40b41)
  #18 QThreadPrivate::start(void*) /usr/local/google/home/joshuaduong/qt-build/src/qt-everywhere-src-5.11.1/qtbase/src/corelib/thread/qthread_unix.cpp:367:14 (libQt5Core.so.5+0xa7d35)

SUMMARY: ThreadSanitizer: data race /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:167:9 in qemu_bh_schedule

WARNING: ThreadSanitizer: data race (pid=3742) Read of size 4 at 0x7b4400025758 by thread T14 (mutexes: write M1097): #0 aio_notify /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:342:14 (qemu-system-x86_64+0xb46778) #1 qemu_bh_schedule /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:168 (qemu-system-x86_64+0xb46778) #2 worker_thread /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/thread-pool.c:114:9 (qemu-system-x86_64+0xb4838c) #3 qemu_thread_trampoline /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-thread-posix.c:551:17 (qemu-system-x86_64+0xb4fe36)
Previous atomic write of size 4 at 0x7b4400025758 by thread T10 (mutexes: write M985):
  #0 __tsan_atomic32_fetch_add /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interface_atomic.cc:616:3 (qemu-system-x86_64+0x51df18)
  #1 aio_poll /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/aio-posix.c:608:9 (qemu-system-x86_64+0xb4c342)
  #2 blk_prw /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1247:9 (qemu-system-x86_64+0x98bbbd)
  #3 blk_pread /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1409:15 (qemu-system-x86_64+0x98b91a)
  #4 find_image_format /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:701:11 (qemu-system-x86_64+0x958068)
  #5 bdrv_open_inherit /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:2690 (qemu-system-x86_64+0x958068)
  #6 bdrv_open /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:2802:12 (qemu-system-x86_64+0x958d1f)
  #7 blk_new_open /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:375:10 (qemu-system-x86_64+0x98927a)
  #8 blockdev_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../blockdev.c:598:15 (qemu-system-x86_64+0x9c5700)
  #9 drive_new /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../blockdev.c:1092 (qemu-system-x86_64+0x9c5700)
  #10 drive_init(void*, QemuOpts*, Error**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/drive-share.cpp:366:28 (qemu-system-x86_64+0xb1e9a2)
  #11 qemu_opts_foreach /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-option.c:1106:14 (qemu-system-x86_64+0xb696e5)
  #12 android_drive_share_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/drive-share.cpp:530:9 (qemu-system-x86_64+0xb1e1e0)
  #13 main_impl /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:5244:13 (qemu-system-x86_64+0x54f811)
  #14 run_qemu_main /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:3349:21 (qemu-system-x86_64+0x547a02)
  #15 enter_qemu_main_loop(int, char**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/main.cpp:606:5 (qemu-system-x86_64+0x54427c)
  #16 MainLoopThread::run() /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android/android-emu/android/skin/qt/emulator-qt-window.h:73:13 (qemu-system-x86_64+0xc40b41)
  #17 QThreadPrivate::start(void*) /usr/local/google/home/joshuaduong/qt-build/src/qt-everywhere-src-5.11.1/qtbase/src/corelib/thread/qthread_unix.cpp:367:14 (libQt5Core.so.5+0xa7d35)

Location is heap block of size 296 at 0x7b44000256c0 allocated by thread T10:
  #0 calloc /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:684:5 (qemu-system-x86_64+0x4d26ef)
  #1 g_malloc0 /tmp/jansene-build-temp-193494/src/glib-2.38.2/glib/gmem.c:134 (qemu-system-x86_64+0xecfe18)
  #2 qemu_init_main_loop /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/main-loop.c:165:24 (qemu-system-x86_64+0xb4abac)
  #3 main_impl /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:4605:9 (qemu-system-x86_64+0x54b937)
  #4 run_qemu_main /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:3349:21 (qemu-system-x86_64+0x547a02)
  #5 enter_qemu_main_loop(int, char**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/main.cpp:606:5 (qemu-system-x86_64+0x54427c)
  #6 MainLoopThread::run() /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android/android-emu/android/skin/qt/emulator-qt-window.h:73:13 (qemu-system-x86_64+0xc40b41)
  #7 QThreadPrivate::start(void*) /usr/local/google/home/joshuaduong/qt-build/src/qt-everywhere-src-5.11.1/qtbase/src/corelib/thread/qthread_unix.cpp:367:14 (libQt5Core.so.5+0xa7d35)

Mutex M1097 (0x7b3800009bd0) created at:
  #0 pthread_mutex_init /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:1187:3 (qemu-system-x86_64+0x4d4ebc)
  #1 qemu_mutex_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-thread-posix.c:61:11 (qemu-system-x86_64+0xb4f2a7)
  #2 thread_pool_init_one /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/thread-pool.c:308:5 (qemu-system-x86_64+0xb47a7b)
  #3 thread_pool_new /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/thread-pool.c:327 (qemu-system-x86_64+0xb47a7b)
  #4 aio_get_thread_pool /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:320:28 (qemu-system-x86_64+0xb46934)
  #5 paio_submit_co /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/file-posix.c:1565:12 (qemu-system-x86_64+0xa15d23)
  #6 raw_co_prw /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/file-posix.c:1620 (qemu-system-x86_64+0xa15d23)
  #7 raw_co_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/file-posix.c:1627 (qemu-system-x86_64+0xa15d23)
  #8 bdrv_driver_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/io.c:924:16 (qemu-system-x86_64+0x8d91f7)
  #9 bdrv_aligned_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/io.c:1228 (qemu-system-x86_64+0x8d91f7)
  #10 bdrv_co_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/io.c:1324:11 (qemu-system-x86_64+0x8d8dd7)
  #11 blk_co_preadv /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1158:11 (qemu-system-x86_64+0x98b0da)
  #12 blk_read_entry /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1206:17 (qemu-system-x86_64+0x98c1e3)
  #13 coroutine_trampoline /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/coroutine-ucontext.c:173:9 (qemu-system-x86_64+0xb71c8a)
  #14 <null> <null> (libc.so.6+0x43fcf)

Mutex M985 (0x000003bb0fb8) created at:
  #0 pthread_mutex_init /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:1187:3 (qemu-system-x86_64+0x4d4ebc)
  #1 qemu_mutex_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-thread-posix.c:61:11 (qemu-system-x86_64+0xb4f2a7)
  #2 qemu_init_cpu_loop /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../cpus.c:1123:5 (qemu-system-x86_64+0x5ddb0c)
  #3 main_impl /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:3437:5 (qemu-system-x86_64+0x547ae0)
  #4 run_qemu_main /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:3349:21 (qemu-system-x86_64+0x547a02)
  #5 enter_qemu_main_loop(int, char**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/main.cpp:606:5 (qemu-system-x86_64+0x54427c)
  #6 MainLoopThread::run() /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android/android-emu/android/skin/qt/emulator-qt-window.h:73:13 (qemu-system-x86_64+0xc40b41)
  #7 QThreadPrivate::start(void*) /usr/local/google/home/joshuaduong/qt-build/src/qt-everywhere-src-5.11.1/qtbase/src/corelib/thread/qthread_unix.cpp:367:14 (libQt5Core.so.5+0xa7d35)

Thread T14 (tid=3886, running) created by thread T10 at:
  #0 pthread_create /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:968:3 (qemu-system-x86_64+0x4d3cda)
  #1 qemu_thread_create /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-thread-posix.c:591:11 (qemu-system-x86_64+0xb4fcef)
  #2 do_spawn_thread /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/thread-pool.c:135:5 (qemu-system-x86_64+0xb48072)
  #3 spawn_thread_bh_fn /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/thread-pool.c:143 (qemu-system-x86_64+0xb48072)
  #4 aio_bh_call /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:90:5 (qemu-system-x86_64+0xb46616)
  #5 aio_bh_poll /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:118 (qemu-system-x86_64+0xb46616)
  #6 aio_poll /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/aio-posix.c:706:17 (qemu-system-x86_64+0xb4cf45)
  #7 blk_prw /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1247:9 (qemu-system-x86_64+0x98bbbd)
  #8 blk_pread /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:1409:15 (qemu-system-x86_64+0x98b91a)
  #9 find_image_format /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:701:11 (qemu-system-x86_64+0x958068)
  #10 bdrv_open_inherit /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:2690 (qemu-system-x86_64+0x958068)
  #11 bdrv_open /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block.c:2802:12 (qemu-system-x86_64+0x958d1f)
  #12 blk_new_open /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../block/block-backend.c:375:10 (qemu-system-x86_64+0x98927a)
  #13 blockdev_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../blockdev.c:598:15 (qemu-system-x86_64+0x9c5700)
  #14 drive_new /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../blockdev.c:1092 (qemu-system-x86_64+0x9c5700)
  #15 drive_init(void*, QemuOpts*, Error**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/drive-share.cpp:366:28 (qemu-system-x86_64+0xb1e9a2)
  #16 qemu_opts_foreach /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/qemu-option.c:1106:14 (qemu-system-x86_64+0xb696e5)
  #17 android_drive_share_init /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/drive-share.cpp:530:9 (qemu-system-x86_64+0xb1e1e0)
  #18 main_impl /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:5244:13 (qemu-system-x86_64+0x54f811)
  #19 run_qemu_main /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../vl.c:3349:21 (qemu-system-x86_64+0x547a02)
  #20 enter_qemu_main_loop(int, char**) /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/main.cpp:606:5 (qemu-system-x86_64+0x54427c)
  #21 MainLoopThread::run() /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android/android-emu/android/skin/qt/emulator-qt-window.h:73:13 (qemu-system-x86_64+0xc40b41)
  #22 QThreadPrivate::start(void*) /usr/local/google/home/joshuaduong/qt-build/src/qt-everywhere-src-5.11.1/qtbase/src/corelib/thread/qthread_unix.cpp:367:14 (libQt5Core.so.5+0xa7d35)

Thread T10 'MainLoopThread' (tid=3881, running) created by main thread at:
  #0 pthread_create /usr/local/google/home/lfy/aosp-llvm-toolchain/toolchain/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:968:3 (qemu-system-x86_64+0x4d3cda)
  #1 QThread::start(QThread::Priority) /usr/local/google/home/joshuaduong/qt-build/src/qt-everywhere-src-5.11.1/qtbase/src/corelib/thread/qthread_unix.cpp:726:16 (libQt5Core.so.5+0xa84fb)
  #2 skin_winsys_spawn_thread /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android/android-emu/android/skin/qt/winsys-qt.cpp:519:17 (qemu-system-x86_64+0xbcf68b)
  #3 main /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../android-qemu2-glue/main.cpp:1624:5 (qemu-system-x86_64+0x543f9c)

SUMMARY: ThreadSanitizer: data race /usr/local/google/home/lfy/emu2/master/external/qemu/objs/../util/async.c:342:14 in aio_notify

yuri updated this revision to Diff 179049.Dec 20 2018, 5:41 AM
  • introduce ThreadType enum
  • cleanup internal interfaces
  • add pthread_exit() interceptor
  • more tests
yuri added a comment.EditedDec 20 2018, 7:52 AM

I don't completely follow logic behind cur_thread/cur_thread_fast/cur_thread1 and how it does not introduce slowdown.
cur_thread_fast still includes an indirection, so it looks exactly the same as the first version.
cur_thread_fast does not check for NULL, so I would expect it to be used only in few carefully chosen functions. But it's used in most entry functions. Which makes me wonder: when cur_thread_fast cannot be used if it can be used in all these places?
It seems that you use cur_thread only to lazily initialize cur_thread1 in Initialize or ThreadStart. But then the question is: why don't we explicitly intialize cur_thread1 in these functions, and then just have a single cur_thread function as before with no 2 subtly different accessors?

I assume that after program or new thread starts sanitizer will first see few (or maybe zero) interceptors, then _func_entry() and then everything else. Following this logic, I use cur_thread_fast() in all performance-critical entry points excluding _func_entry().

I ran tsan benchmarks and it looks like memory indirection by itself does not affect performance because variable is in CPU cache in most cases. At the same time, conditional check has visible effect on performance.

Here was discussion of benchmark results. I removed it because looks like times are randomly floating around 3-4% which is more then impact of the patch.

yuri marked 13 inline comments as done.Dec 20 2018, 8:05 AM
yuri added inline comments.
lib/tsan/rtl/tsan_rtl.h
880

We can't use enum type of parameter because it is combination of bits. This is done the same way as enum MutexFlags in tsan_sync.h

yuri marked an inline comment as done.Dec 20 2018, 8:05 AM
yuri added a comment.Dec 20 2018, 8:15 AM

I added default synchronization and flag to opt-out.

Can you swicth to it in your codebase? I would expect that you need all (or almost all) of this synchronization anyway?

Unfortunately, I can't check it. Current implementation of our codebase work fine with synchronization on fiber switch.

I plan to implement another mode when fibers are not synchronized by default, but only when they call special synchronization APIs (events, mutexes etc.). Such way I can catch more errors in code when fibers are running in the same thread only by chance. In order to implement it, I need some way to opt-out of synchronization in tsan.

yuri updated this revision to Diff 179085.Dec 20 2018, 8:26 AM

Fix typo

I added default synchronization and flag to opt-out.

Can you swicth to it in your codebase? I would expect that you need all (or almost all) of this synchronization anyway?

Unfortunately, I can't check it. Current implementation of our codebase work fine with synchronization on fiber switch.

I plan to implement another mode when fibers are not synchronized by default, but only when they call special synchronization APIs (events, mutexes etc.). Such way I can catch more errors in code when fibers are running in the same thread only by chance. In order to implement it, I need some way to opt-out of synchronization in tsan.

Sorry, I meant to switch to the built-in synchronization to validate that it works in real projects and either removes the need in manual synchronization annotations or at least significantly reduces the need in manual annotations.

lib/tsan/rtl/tsan_rtl.h
880

Then I guess I am out of arguments because I wrote that code :)

Re performance, we have this open performance regression in clang codegen which makes it harder to do analysis:
https://bugs.llvm.org/show_bug.cgi?id=39748
https://reviews.llvm.org/D54821

I've benchmarked on 350140 with host gcc version 7.3.0 (Debian 7.3.0-5), running old/new binary alternated:

int main() {
  const int kSize = 2<<10;
  const int kRepeat = 1<<19;
  volatile long data[kSize];
  for (int i = 0; i < kRepeat; i++) {
    for (int j = 0; j < kSize; j++)
      data[j] = 1;
    __atomic_load_n(&data[0], __ATOMIC_ACQUIRE);
    __atomic_store_n(&data[0], 1, __ATOMIC_RELEASE);
 }
}

compiler-rt$ TIME="%e" nice -20 time taskset -c 0 ./current.test
8.68
8.71
8.63
8.67
8.73
8.63

compiler-rt$ TIME="%e" nice -20 time taskset -c 0 ./fiber.test
9.97
9.85
9.99
9.91
9.88
9.92
9.89
9.88

This looks like 14% degradation.

yuri updated this revision to Diff 180547.Jan 7 2019, 12:46 PM

Improved performance

yuri added a comment.Jan 7 2019, 12:58 PM

I improved test execution time. On my system I got following execution times (compared to original version of code):

compilercurrentfibers
gcc 8.2100.0%98.6%
clang112.4%100.7%

I improved test execution time. On my system I got following execution times (compared to original version of code):

compilercurrentfibers
gcc 8.2100.0%98.6%
clang112.4%100.7%

"current" is compiler-rt HEAD with no local changes, or with the previous version of this change? This makes clang-compiled runtime 12% faster?
Were you able to reproduce the slowdown in https://reviews.llvm.org/D54889#1343582 with the previous version of the fibers change? What slowdown did you get?

I improved test execution time. On my system I got following execution times (compared to original version of code):

compilercurrentfibers
gcc 8.2100.0%98.6%
clang112.4%100.7%

"current" is compiler-rt HEAD with no local changes, or with the previous version of this change? This makes clang-compiled runtime 12% faster?
Were you able to reproduce the slowdown in https://reviews.llvm.org/D54889#1343582 with the previous version of the fibers change? What slowdown did you get?

What exactly has changed wrt performance? I see some small tweaks, but I am not sure if they are the main reason behind the speedup or I am missing something important.

yuri added a comment.Jan 8 2019, 2:23 AM

"current" is compiler-rt HEAD with no local changes, or with the previous version of this change? This makes clang-compiled runtime 12% faster?

"current" is compiler-rt HEAD without any changes.
"clang" is LLVM and clang HEAD.

In case of clang-compiled library, even previous version of patch was faster then HEAD.

Looks like additional indirection of ThreadState by itself introduce minimal overhead, but affect code generation in unpredictable way, especially for clang.

Were you able to reproduce the slowdown in https://reviews.llvm.org/D54889#1343582 with the previous version of the fibers change? What slowdown did you get?

Yes, I can see slowdown with previous version of patch and gcc-compiled library around 3.5%.

What exactly has changed wrt performance? I see some small tweaks, but I am not sure if they are the main reason behind the speedup or I am missing something important.

You may consider this as "cheating" because changes are general and not related to fibers.

Most speedup is because of change in tsan_rtl.cc:

- if (!SANITIZER_GO && *shadow_mem == kShadowRodata) {
+ if (!SANITIZER_GO && !kAccessIsWrite && *shadow_mem == kShadowRodata) {

Placing LIKELY/UNLIKELY in code gives additional 1-2%.

Benchmark results with clang:

write8:
HEAD: 9.74
fibers(old): 9.40
fibers(new): 8.66

read8:
HEAD: 9.92
fibers(new): 9.26

read4:
HEAD: 10.20
fibers(new): 9.36

fibers(new) is the current version of the change. fibers(old) is the one I used with gcc (no spot optimizations).

So this change indeed makes it faster for clang, but the fastest clang version is still slower then gcc, so this suggests that there is something to improve in clang codegen. If we improve clang codegen, will this change again lead to slower code or not? It's bad that we have that open clang regression...

yuri updated this revision to Diff 181253.Jan 11 2019, 5:19 AM

Optimized MemoryAccessImpl1()

yuri added a comment.Jan 11 2019, 5:23 AM

I have found out which optimization is applied by gcc but not by clang and did it manually. Executions times for new version:

compilerHEADfibers
gcc 8.2100.0%94.4%
clang112.4%94.1%
yuri added a comment.Jan 21 2019, 8:01 AM

Hi Dmitry,
Now, when all performance issues are resolved, what else should be done to complete the review?

Now, when all performance issues are resolved, what else should be done to complete the review?

Just me finding time to review it.
I need to put together all numbers. Is it faster than it was (that is, new optimizations outweigh the change to cur_thread)? Or the change to cur_thread makes it faster by itself? I am somewhat concerned by the "cheating", I understand that it's you who found these potential additional optimizations, but these still unrelated to the cur_thread change. Thousands of existing tsan users don't use fibers at all today.
Thanks for your persistence btw.

Spent a while benchmarking this. I used 4 variations of a test program:
First just does read or write access of the given size:
https://gist.githubusercontent.com/dvyukov/9155651ecb9434368f06a30df2abcb20/raw/c0756a9d643718faaacda2c78ace1cb95f05af87/gistfile1.txt
Second, prepopulates shadow with some other accesses first:
https://gist.githubusercontent.com/dvyukov/12b4ca0a15ffccccdaf9e92d20df40e2/raw/a2cdfc65fccf1eecdc232c876dc45e41229f6ce1/gistfile1.txt
Next has 2 threads that access the range in round-robin, both threads do writes (or reads):
https://gist.githubusercontent.com/dvyukov/0d84c9d9795cbbd8b9b9bb557d1e2f36/raw/24dcdf48f39cd12ac77ae71e9de7f8ee01c17df2/gistfile1.txt
The last one has 1 threads that does writes, followed by 2 threads that do reads:
https://gist.githubusercontent.com/dvyukov/1efa2c0335d9d944b33857dbd70cb8ce/raw/6eff5ea186fd5d8540090faaf344ad66250c6e5e/gistfile1.txt

I used 3 variations of the runtime: current tsan runtime without modifications, this change, this change with cur_thread/cur_thread_fast returning reinterpret_cast<ThreadState *>(cur_thread_placeholder). In all cases I used self-hosted clang build on revision 353000.
Each test was run 3 times and the fastest time was taken.
The current code was the slowest in all cases, so I removed from the results. Here is slowdown in % for fiber support as compared to returning cur_thread_placeholder:

Fibers seem to incur ~4% slowdown on average.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 5 2019, 6:10 AM

Since fiber support incurs slowdown for all current tsan users who don't use fibers, this is a hard decision.

I've prototyped a change which leaves fast_state and fast_synch_epoch in TLS (the only state accessed on fast path besides the clock):

struct FastThreadState {
  FastState fast_state;
  u64 fast_synch_epoch;
};

__attribute__((tls_model("initial-exec")))
extern THREADLOCAL char cur_thread_faststate1[];

INLINE FastThreadState& cur_thread_faststate() {
  return *reinterpret_cast<FastThreadState *>(cur_thread_faststate1);
}

But this seems to be even slower than just using the single pointer indirection.

Also benchmarked function entry/exit using the following benchmark:

// foo1.c
void foo(bool x);
void bar() {}
int main() {

volatile int kRepeat = 1 << 30;
const int repeat = kRepeat;
for (int i = 0; i < repeat; i++)
  foo(false);

}

// foo2.c
void bar();
void foo(bool x) {

if (x)
  bar();

}

The program spends ~75% of time in __tsan_func_entry/exit. Rest of the conditions are the same as in the previous benchmark.

Current code runs in 7.16s
This change -- 6.23s
This change + cur_thread_fast returning cur_thread_placeholder -- 7.01s
I also tried this change + FuncEntry using cur_thread_fast -- 6.20s

Using the pointer indirection seems to positively affect func entry/exit codegen.

Current code runs in 7.16s
This change -- 6.23s
This change + cur_thread_fast returning cur_thread_placeholder -- 7.01s
I also tried this change + FuncEntry using cur_thread_fast -- 6.20s

But if I do:

INLINE ThreadState *cur_thread_fast() {
  ThreadState* thr;
  __asm__("": "=a"(thr): "a"(&cur_thread_placeholder[0]));
  return thr;
}

(which is a dirty trick to force compiler to cache address of the tls object in a register) then the program runs 5.94s -- faster than any other options as it takes advantage of both no indirection and faster instructions.

But this is not beneficial for __tsan_read/write functions because caching the address takes a register and these functions are already severely short on registers.

Going forward I think we should get in all unrelated/preparatory changed first: thread type (creates lots of diffs), pthread_exit interceptor/test and spot optimizations to memory access functions.

yuri added a comment.EditedFeb 6 2019, 12:45 PM

Going forward I think we should get in all unrelated/preparatory changed first: thread type (creates lots of diffs), pthread_exit interceptor/test and spot optimizations to memory access functions.

Great to hear it!

I have uploaded first part (thread type) at D57839.

yuri added a comment.Feb 6 2019, 11:50 PM

One more patch (pthread_exit): D57876

yuri added a comment.Feb 7 2019, 3:17 AM

Next patch (performance optimizations): D57882

yuri updated this revision to Diff 185745.Feb 7 2019, 4:57 AM

Rebased on current trunk

yuri marked an inline comment as done.Feb 7 2019, 4:58 AM

Current code runs in 7.16s
This change -- 6.23s
This change + cur_thread_fast returning cur_thread_placeholder -- 7.01s
I also tried this change + FuncEntry using cur_thread_fast -- 6.20s

But if I do:

INLINE ThreadState *cur_thread_fast() {
  ThreadState* thr;
  __asm__("": "=a"(thr): "a"(&cur_thread_placeholder[0]));
  return thr;
}

(which is a dirty trick to force compiler to cache address of the tls object in a register) then the program runs 5.94s -- faster than any other options as it takes advantage of both no indirection and faster instructions.

This is looked like an interesting optimization, but turns out to be too sensitive to unrelated code changes.
With a slightly changed benchmark:
http://llvm.org/viewvc/llvm-project?view=revision&revision=353407
This change shows ~2% slowdown instead: from 6s to 6.15s.

Did another round of benchmarking of this change on the current HEAD using these 2 benchmarks:
https://github.com/llvm-mirror/compiler-rt/blob/master/lib/tsan/benchmarks/mop.cc
https://github.com/llvm-mirror/compiler-rt/blob/master/lib/tsan/benchmarks/func_entry_exit.cc

Here fibers is this change, and fibers* is this change with 2 additional changes:

  1. cur_thread_fast in __tsan_func_entry.
  2. cur_thread1 embed in the ThreadState. The reason for this is that cur_thread1 and ThreadState are currently separated and in different cache lines (cur_thread is allocated with ALIGN(64) attribute and all hot members are in the beginning of the struct). So what I did is:

    struct ThreadState { FastState fast_state; u64 fast_synch_epoch; ThreadState* current; // <----- ADD THIS ...

    INLINE ThreadState *cur_thread_fast() { //return cur_thread1; return reinterpret_cast<ThreadState *>(cur_thread_placeholder)->current; }

Now ~2% slowdown on highly synthetic benchmarks looks like something we can tolerate (2 cases are actually faster).

The cur_thread/cur_thread_fast separation still looks confusing to me. It's a convoluted way to do lazy initialization. If one adds any new calls to these functions, which one to choose is non-obvious. I think we should do lazy initialization explicitly. Namely, leave cur_thread alone, don't introduce cur_thread_fast, don't change any call sites. Instead, add init_cur_thread call that does lazy initialization to interceptor entry point and any other points that we expect can be the first call into tsan runtime overall or within a new thread. I think interceptors and tsan_init should be enough (no tsan_func_entry). We call __tsan_init from .preinit_array, instrumented code can't be executed before .preinit_array, only interceptors from dynamic loader can precede .preinit_array callbacks.

With these 3 changes, it looks good to me and I am ready to merge it.

To clarify the graph: it's difference in execution time in percents as compared to the current HEAD. I.e. -4 means that fibers are 4% slower than the current HEAD.
1w means mop.cc benchmark with 1-byte write accesses; 4r - 4-byte read accesses. ee is func_entry_exit.cc benchmark.

yuri updated this revision to Diff 185949.Feb 8 2019, 5:23 AM
  • Replaced cur_thread1 with ThreadState::current
  • Removed cur_thread_fast() and replaced calls with cur_thread()
  • Added cur_thread_init()
yuri added a comment.EditedFeb 8 2019, 5:28 AM

The cur_thread/cur_thread_fast separation still looks confusing to me. It's a convoluted way to do lazy initialization. If one adds any new calls to these functions, which one to choose is non-obvious. I think we should do lazy initialization explicitly. Namely, leave cur_thread alone, don't introduce cur_thread_fast, don't change any call sites. Instead, add init_cur_thread call that does lazy initialization to interceptor entry point and any other points that we expect can be the first call into tsan runtime overall or within a new thread. I think interceptors and tsan_init should be enough (no tsan_func_entry). We call __tsan_init from .preinit_array, instrumented code can't be executed before .preinit_array, only interceptors from dynamic loader can precede .preinit_array callbacks.

For now I added calls to cur_thread_init() into 3 places. It was enough to pass all tests on my system. I am not sure if it will work with different versions of glibc. What do you think about it?

yuri updated this revision to Diff 186145.Feb 10 2019, 2:07 AM

Rebased on current trunk

The change is now a very good shape.

For now I added calls to cur_thread_init() into 3 places. It was enough to pass all tests on my system. I am not sure if it will work with different versions of glibc. What do you think about it?

I've tested on 2 more distributions and it worked for me.
However, for thread initialization we rely on the fact that pthread's thread entry function (start_thread) calls _setjmp which we intercept. If start_thread does not call _setjmp then we risk to hit NULL cur_thread in secondary threads. I think we need to add cur_thread_init also to:

  1. SCOPED_INTERCEPTOR_RAW (this will ensure that if start_thread calls any intercepted libc function, we will init at that point)
  2. tsan_thread_start_func (an analog of tsan_init for secondary threads, this function is not instrumented, so even if start_thread does not call any intercepted functions we still can initialize at that point without slowing down __tsan_func_entry)
yuri added a comment.Feb 11 2019, 4:10 AM

However, for thread initialization we rely on the fact that pthread's thread entry function (start_thread) calls _setjmp which we intercept. If start_thread does not call _setjmp then we risk to hit NULL cur_thread in secondary threads. I think we need to add cur_thread_init also to:

  1. SCOPED_INTERCEPTOR_RAW (this will ensure that if start_thread calls any intercepted libc function, we will init at that point)

There is a lot if interceptors that do

if (cur_thread()->in_symbolizer)

before SCOPED_INTERCEPTOR_RAW. What to do with them?

However, for thread initialization we rely on the fact that pthread's thread entry function (start_thread) calls _setjmp which we intercept. If start_thread does not call _setjmp then we risk to hit NULL cur_thread in secondary threads. I think we need to add cur_thread_init also to:

  1. SCOPED_INTERCEPTOR_RAW (this will ensure that if start_thread calls any intercepted libc function, we will init at that point)

There is a lot if interceptors that do

if (cur_thread()->in_symbolizer)

before SCOPED_INTERCEPTOR_RAW. What to do with them?

Yikes! Good question!

If we are in the symbolize we've already initialized cur_thread, since we are coming recursively from runtime. But this does not help because if we are not in symbolizer, we can have cur_thread not initialized...

We have it in malloc, atexit and similar fundamental functions that can well be a function called during process or thread start.

All of in_symbolizer checks call cur_thread in the same expression rather than use some local variable, i.e. they are of the form:

if (cur_thread()->in_symbolizer)

which suggests that we should introduce a helper in_symbolizer(void) function that will incapsulate cur_thread_init and the check (probably should go into tsan_interceptors.h).
Please send a separate change that adds such helper function and then this change will just add cur_thread_init to that function.

yuri added a comment.Feb 11 2019, 11:37 PM

Patch with in_symbolizer() function: D58104

yuri updated this revision to Diff 186423.Feb 12 2019, 2:00 AM

Add more calls of cur_thread_init()

dvyukov accepted this revision.Feb 12 2019, 2:12 AM
This revision is now accepted and ready to land.Feb 12 2019, 2:12 AM
dvyukov closed this revision.Feb 12 2019, 2:14 AM

Committed in:
http://llvm.org/viewvc/llvm-project?view=revision&revision=353817

Thanks for bearing with me, this touched very sensitive parts of runtime so I did not want to crush. This adds a useful capability to ThreadSanitizer. The performance improvements resulted from this work are much appreciated too.

yuri added a comment.Feb 12 2019, 2:33 AM

Final patch (macOS support): D58110

yuri added a subscriber: lei.Feb 12 2019, 5:07 AM
yuri added a comment.Feb 12 2019, 5:10 AM

@lei, please help to investigate what happened in http://lab.llvm.org:8011/builders/sanitizer-ppc64be-linux/builds/11413

Is it possible get stack trace of crash?

rovka added a subscriber: rovka.Feb 13 2019, 3:21 AM

Hi Yuri,

I think this might be breaking our aarch64-full buildbot [1]. I ran compiler-rt/test/sanitizer_common/tsan-aarch64-Linux/Linux/Output/clock_gettime.c.tmp and get this stack trace [2]:

Any ideas?

The bot has been red for quite a while now, so I think I will have to revert this while we investigate.

[1] http://lab.llvm.org:8011/builders/clang-cmake-aarch64-full/builds/6556
[2] https://pastebin.com/1cr5hMD2

yuri added a comment.Feb 13 2019, 3:46 AM

Hi Yuri,

I think this might be breaking our aarch64-full buildbot [1]. I ran compiler-rt/test/sanitizer_common/tsan-aarch64-Linux/Linux/Output/clock_gettime.c.tmp and get this stack trace [2]:

Any ideas?

Please see fix in D58171

Resubmitted in 353947 with 2 fixes squashed.

  • Forwarded message ---------

From: Yi-Hong Lyu via Phabricator <reviews@reviews.llvm.org>
Date: Wed, Feb 13, 2019 at 5:33 PM
Subject: [Diffusion] rL353817: tsan: add fiber support

Yi-Hong.Lyu added a comment.
This patch seems cause failure of buildbot http://lab.llvm.org:8011/builders/clang-ppc64le-linux-multistage/builds/9150

I see a similar failure was already reported:
https://reviews.llvm.org/D54889#1394632
There was no fix for it, right? Or did I forgot to squash something else?

yuri added a comment.Feb 13 2019, 8:47 AM
  • Forwarded message ---------

From: Yi-Hong Lyu via Phabricator <reviews@reviews.llvm.org>
Date: Wed, Feb 13, 2019 at 5:33 PM
Subject: [Diffusion] rL353817: tsan: add fiber support

Yi-Hong.Lyu added a comment.
This patch seems cause failure of buildbot http://lab.llvm.org:8011/builders/clang-ppc64le-linux-multistage/builds/9150

Looks like it is already fixed. See http://lab.llvm.org:8011/builders/clang-ppc64le-linux-multistage/builds/9157

yuri added a comment.Feb 13 2019, 8:51 AM

I see a similar failure was already reported:
https://reviews.llvm.org/D54889#1394632
There was no fix for it, right? Or did I forgot to squash something else?

It is also fixed: http://lab.llvm.org:8011/builders/sanitizer-ppc64be-linux/builds/11427

I work on integrating this for OpenMP task into openmp/tools/archer/ompt-tsan.cpp
I just submitted a patch for review to make the interface dynamic linkable: https://reviews.llvm.org/D74487

Initial experiments of the integration into ompt-tsan show two issues:

  • switching overhead for small tasks is significant (70x over the execution without switching fibers)
  • for large numbers of fibers (~15k) I get allocation errors, although top doesn't show significant memory usage by the application:
==274531==FATAL: ThreadSanitizer: internal allocator is out of memory trying to allocate 0x3fb58 bytes

or

==274830==ERROR: ThreadSanitizer failed to deallocate 0x41000 (266240) bytes at address 0x7f8813fd2000
==274830==ERROR: ThreadSanitizer failed to deallocate 0x43000 (274432) bytes at address 0x0e2050cf1000
FATAL: ThreadSanitizer CHECK failed: /home/pj416018/TSAN/llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_posix.cc:61 "(("unable to unmap" && 0)) != (0)" (0x0, 0x0)