This is an archive of the discontinued LLVM Phabricator instance.

[NativeProcessLinux] (Partially) fix race condition during inferior thread creation
ClosedPublic

Authored by labath on Apr 21 2015, 4:41 AM.

Details

Summary

The following situation occured if we were stopping a process (due to breakpoint, watchpoint, ...
hit) while a new thread was being created.

  • process has two threads: A and B.
  • thread A hits a breakpoint: we send a STOP signal to thread B and register a callback with ThreadStateCoordinator to send a stop notification after the thread stops.
  • thread B stops, but not due to the SIGSTOP, but on a thread creation event (of a new thread C). We are unaware of our desire to stop, so we queue ThreadStopped and RequestResume operations with TSC, so the thread can continue running.
  • TSC receives the ThreadStopped event, sees that all threads are stopped and fires the delayed stop notification.
  • immediately after that TSC gets the RequestResume operation, so it resumes the thread.

At this point the state is inconsistent because LLDB thinks the process is stopped and will start
issuing commands to it, but one of the threads is in fact running. Things eventually break.

I address this problem by omitting the two TSC events altogether and Resuming the thread B
directly. This way the short stop is invisible to the TSC and the delayed notification will not
fire. We will fire the notification when we actually process the SIGSTOP on thread B.

This patch also removes the synchronisation between the thread creation notifications on threads
B and C. The need for this synchronisation is unclear (the comments seem to hint that the new
thread is "fully created" only after we process both events, but I have noticed no regressions in
treating it as "created" even after just processing the initial C event), but it is a source for
many kinds of obscure races, since it introduces a new thread state "Launching" and the rest of
the code does not handle this state at all (what happens if we get a resume request from LLDB
while this thread is launching? what happens if we get a stop request? etc.).

This fixes the "spurious $O packet" problem in TestPrintStackTraces.py. However, I cannot enable
the test yet, as a different, stack-unwinding, issue still sporadically crops up.

Note that there is still one issue remaining in the situation described above: Once the get the
"new thread notification" for thread C, we will immediately resume it and we will end up with a
thread running anyway. This issue needs to be handled differently (since the thread C was never
sent a SIGSTOP in the first place, and I will address this in a follow-up patch).

Diff Detail

Repository
rL LLVM

Event Timeline

labath updated this revision to Diff 24114.Apr 21 2015, 4:41 AM
labath retitled this revision from to [NativeProcessLinux] (Partially) fix race condition during inferior thread creation.
labath updated this object.
labath edited the test plan for this revision. (Show Details)
labath added reviewers: vharron, chaoren.
labath added a subscriber: Unknown Object (MLST).
chaoren edited edge metadata.Apr 21 2015, 10:52 AM

Please see inline comments:

source/Plugins/Process/Linux/NativeProcessLinux.cpp
2289 ↗(On Diff #24114)

It seems that this and the SI_USER signal could happen in any order, are you sure it's safe to resume the new thread if this signal hasn't been received yet? Is it possible that you've just been lucky and have always been getting this before the SI_USER signal.

2606 ↗(On Diff #24114)

Please remove this comment. It doesn't make sense anymore.

2610 ↗(On Diff #24114)

I think if we resumed the thread directly and used m_coordinator_up->NotifyThreadCreate (not stopped), we can bypass this issue, since TSC::ThreadWasCreated has a check for pending thread stop notifications.

2645 ↗(On Diff #24114)

This function is only used here and the other place you deleted. Should we get rid of it?

2663 ↗(On Diff #24114)

Let's hope Todd is wrong.

2665 ↗(On Diff #24114)

We might as well remove the launching state entirely then?

labath updated this revision to Diff 24206.Apr 22 2015, 4:00 AM
labath edited edge metadata.

Update based on review comments.

I've updated the code based on your suggestions. Please take another look.

source/Plugins/Process/Linux/NativeProcessLinux.cpp
2289 ↗(On Diff #24114)

Yes, the two events indeed do come in random order. I have verified this. This comes from the fact that they happen on two threads and the order depends on which order waitpid visits the threads when looking for events. However, I postulate that we do not need to synchronize here to be correct. I have gone through the manual page backwards and forwards and I have found nothing to indicate that this would be necessary. Furthermore, I can't think of a valid reason why the kernel would choose to stop the two threads at a point where they already exist as separate entities, but they are still linked in a way that the valid operations on one of them still depend on the state of the other thread. It would be much simpler to stop them a few (kernel) instructions further where the new thread's state is initialised and decoupled from the parent thread.

2606 ↗(On Diff #24114)

done.

2610 ↗(On Diff #24114)

That is an excellent idea. Done as you suggested.

2645 ↗(On Diff #24114)

Good idea. Removed.

2663 ↗(On Diff #24114)

I've been running the test suite for a while and I see no ill effects from deleting this.

2665 ↗(On Diff #24114)

Good point. Done.

chaoren accepted this revision.Apr 22 2015, 8:43 AM
chaoren edited edge metadata.

LGTM

source/Plugins/Process/Linux/NativeThreadLinux.cpp
305 ↗(On Diff #24206)

I wonder if we could get rid of StateType::eStateLaunching entirely. We'd have fewer cases to handle.

test/python_api/lldbutil/process/TestPrintStackTraces.py
24 ↗(On Diff #24206)

This probably solves a bunch of other flakey failures too (e.g., TestSetWatchlocation, TestTargetWatchAddress).

This revision is now accepted and ready to land.Apr 22 2015, 8:43 AM
vharron added inline comments.Apr 22 2015, 3:20 PM
source/Plugins/Process/Linux/NativeProcessLinux.cpp
2264 ↗(On Diff #24206)

points->point

2281 ↗(On Diff #24206)

I was thinking about adding a variable somewhere that tracks our current goal. Do we want all of the threads to be running or stopped?

If the goal is running, we resume. If not, we shouldn't resume, right?

I think we can flip this bit at the same time we send the stop request to all threads.

We can flip it back when we get a continue from the host.
(You don't need to address this in this patch, just sharing an idea I had)

vharron accepted this revision.Apr 22 2015, 3:21 PM
vharron edited edge metadata.
vharron added inline comments.
test/python_api/lldbutil/process/TestPrintStackTraces.py
21 ↗(On Diff #24206)

I would leave this as @expectedFailureLinux unless you've run it 100x successfully.

labath added inline comments.Apr 23 2015, 1:49 AM
source/Plugins/Process/Linux/NativeProcessLinux.cpp
2264 ↗(On Diff #24206)

done.

2281 ↗(On Diff #24206)

I was considering this as well, but I believe this flag will need to be per thread (and perhaps a default flag for threads that created in the meantime), since we sometimes get partial resume requests from the client (e.g. resume threads A and B but keep everything stopped). Knowledge about this is scattered around TSC and NPL, but i would rather not access it now, since I don't know what the other threads are doing. Once we get rid of the TSC thread, I will reconsider this.

source/Plugins/Process/Linux/NativeThreadLinux.cpp
305 ↗(On Diff #24206)

What do you mean by that?
eStateLaunching comes from a global lldb enum, which used in a bunch of places, so we cannot just nix this field out. We could possibly introduce our own enum (or just a "bool running" filed), but I'm not sure if that's a good idea.

test/python_api/lldbutil/process/TestPrintStackTraces.py
21 ↗(On Diff #24206)

I've run it a bunch (around 20-30) of times and it was ok. I'll run it some more and see what happens. I believe the best way to verify it's fixed is to enable it and follow the build bot. It's easy to XFAIL it again if it still turns out to be broken.

24 ↗(On Diff #24206)

Right. I'll run the tests and enable them if they pass.

labath added inline comments.Apr 23 2015, 1:54 AM
test/python_api/lldbutil/process/TestPrintStackTraces.py
21 ↗(On Diff #24206)

On second though, there is still the issue Chaoren pointed out yesterday, which can theoretically break things. I will submit a CL for this and enable the tests then.

This revision was automatically updated to reflect the committed changes.