Page MenuHomePhabricator

LLDB: Fixed race condition on timeout when stopping private state thread
ClosedPublic

Authored by cameron314 on Apr 14 2016, 9:32 AM.

Details

Summary

When stopping the private state thread, there was a race condition between the time the thread exits (resetting the HostThread object) and the time a Join was attempted, especially in the case of a timeout.

The previous workaround of copying the HostThread object is not enough, since on a Reset the internal thread stuff gets nulled out regardless of which HostThread object actually has Reset called on it, resulting in an attempt to dereference a null pointer on the subsequent call to Join from the copy as well.

Additionally, there was a race between the detach (called when stopping the process) and the stop itself, causing the stop to time out because it was waiting for the private state thread to see the stop state, but it had exited immediately after entering the detached state.

In our environment (custom target), we were hitting these races every single time we stopped debugging.

Diff Detail

Repository
rL LLVM

Event Timeline

cameron314 updated this revision to Diff 53729.Apr 14 2016, 9:32 AM
cameron314 retitled this revision from to LLDB: Fixed race condition on timeout when stopping private state thread.
cameron314 updated this object.
cameron314 added reviewers: clayborg, zturner, jingham.
cameron314 set the repository for this revision to rL LLVM.
cameron314 added subscribers: mamai, lldb-commits.
clayborg requested changes to this revision.Apr 14 2016, 11:01 AM
clayborg edited edge metadata.

If you use the instance variable directly, we will need a mutex inside the HostNativeThreadBase class to protect against multi-threaded access. I believe the reason it was being copied before was in case the Reset was called on another thread. So we might want a mutex in HostNativeThreadBase that gets exposed via HostThread. See inlined comments.

source/Target/Process.cpp
4116 ↗(On Diff #53729)

If you are going to do IsJoinable(), Cancel(), and Join() then this is still racy in that someone else could do the same thing on another thread. So this code should probably take a mutex to protect:

Mutex::Locker locker(m_private_state_thread.GetMutex());
if (m_private_state_thread.IsJoinable())

This would need to live in HostNativeThreadBase because HostThread contains a shared pointer to a HostNativeThreadBase and HostThread can be copied.

This revision now requires changes to proceed.Apr 14 2016, 11:01 AM
cameron314 added inline comments.Apr 14 2016, 2:07 PM
source/Target/Process.cpp
4116 ↗(On Diff #53729)

I removed the call to Reset from the other thread (line 4452 below). It didn't make sense for a thread to reset its own HostThread since it's not its own owner, the Process is.

So now only the Process is allowed to control the HostThread it owns, and there's no races and thus no need for a mutex.

Note also that using a copy of the instance variable is no different from using the instance variable directly, here, since as you say they both have a shared pointer to the same underlying object which is manipulated through either of the two to the same effect.

HostThread::Reset() is a bad function and should not be directly used by anyone other that Join() or Detach(). Just clearing the thread ID and setting the result to zero is bad and means we will leak a thread if no one else joins it. If someone else already called Detach() on the thread, then Reset() is ok. Also calling m_private_state_thread.Cancel() is equally bad.

The real question is why is the cancel timing out?

I agree that the call to "m_private_state_thread.Reset();" should be removed from Process::RunPrivateStateThread(). But we should look at why we need to call Cancel in the first place. We should never need to cancel our own thread unless something really bad has gone on. So it seems like the real bug here is bad thread synchronization leading to issues.

I don't like the fact that we are not copying the thread anymore. We have code that sometimes needs to spin up an extra private state thread and I wouldn't want the code that changes the m_private_state_thread after backing up an older one, cause this function to call IsJoinable(), Cancel() and Join() on different objects. So it seems like the only needed fix here is to not call "m_private_state_thread.Reset();" in Process::RunPrivateStateThread(), all other changes can be removed?

Hmm, interesting. Yes, it should not be timing out in the first place. That's likely a bug on our side, still to be determined. I'm looking into it. Regardless, it led us to find and fix this race :-)

I think it still makes sense for LLDB to do a Cancel() as a last-ditch effort on timing out. But Cancel() is fundamentally very dangerous, so then again, maybe not.

Can the code that spins up the extra thread (RunThreadPlan) run on a different thread than ControlPrivateStateThread runs on? I was under the (perhaps entirely false!) impression that the Process object was used by only one thread at a time (with the exception of the private state thread(s), of course, but that thread shouldn't touch its own control variables).

If that's not the case, then looking at the RunThreadPlan code, there's lots more races on m_private_state_thread alone (e.g. calling EqualsThread on it while it could be assigned from another thread). And even just creating a copy is not thread safe, for example, since it could be reassigned from another thread at the same time... you'd need m_private_state_thread to be wrap its shared pointer with std::atomic.

from the C++ docs:

All member functions (including copy constructor and copy assignment) can be called by multiple threads on different instances of shared_ptr without additional synchronization even if these instances are copies and share ownership of the same object. If multiple threads of execution access the same shared_ptr without synchronization and any of those accesses uses a non-const member function of shared_ptr then a data race will occur; the shared_ptr overloads of atomic functions can be used to prevent the data race.

So we might not need to do anything.

So back to my original comment: can we just remove all changes except the "m_private_state_thread.Reset();" in Process::RunPrivateStateThread()?

We should figure out what is going wrong on your system though and figure out why we need to call Cancel() as well.

I read the same docs :D This is the important part:

If multiple threads of execution access the same shared_ptr without synchronization and any of those accesses uses a non-const member function of shared_ptr then a data race will occur; the shared_ptr overloads of atomic functions can be used to prevent the data race.

Copying the shared pointer and accessing its pointee through it is thread safe (copying a wrapper object technically isn't, but it will boil down to the same thing, so let's leave that aside). But copying it while the pointer is being reassigned (operator=) on a different thread is not thread safe.

To answer your question, yes, absolutely, we can just remove that one call to Reset. But if the copy is still necessary, that means a lot of existing code (including that copy) is also racy and should be reviewed (though this would make sense to do in a separate patch).

jingham edited edge metadata.Apr 14 2016, 3:58 PM
jingham added a subscriber: jingham.

At the Command & SB API level, we restrict access to the process from multiple threads with the run lock. But internally it is currently more of a gentleman's agreement not to do this.

I don't think it would ever be useful to try to make calling functions in the target (which is the job of RunThreadPlan) work concurrently. Rather we should have some higher level permission baton that you have to have to do anything that would run the target, and then let only one thread at a time do this work. That would make it impossible for RunThreadPlan to get run on multiple threads for the same process. So while it wouldn't hurt to protect the m_private_state_thread variable with locks, that's not really the right level to do this, and if you ever let two threads try to run function calls in the target simultaneously, racy-ness in the m_private_state_thread would be the least of your problems...

Jim

OK, I'll pare down the patch to just that line then.

I found the reason for the time out. At the end of our debug session, we destroy the Target, which destroys the process. Process::Destroy in turn calls Process::Detach, which calls DoDetach just before calling StopPrivateStateThread (which times out). Our platform's process object (modeled after the GDB remote one) sets the thread state to eStateDetached in its DoDetach implementation; and the private state thread exits as soon as it sees that state, meaning it's no longer around handle the eBroadcastInternalStateControlStop state event from the StopPrivateStateThread just after.

I think the real bug is in StopPrivateStateThread, which only sends the stop state if the thread is still joinable; since I removed the Reset, it is of course still joinable after it exits. But even with the Reset (which we agree shouldn't be there), there is still a race between the time the detach signal is sent and when the thread sees it and exits. It seems once a detach or stop is sent, no further detaches/stops can be sent, regardless of what the thread is doing (since it will exit as soon as it sees the first one). So there should be a flag to that effect somewhere checked in ControlPrivateStateThread. Probably all the calls to IsJoinable() should be replaced with calls to PrivateStateThreadIsValid, which should be updated to check that flag.

I don't see a nice way to add this flag for each private state thread, though, instead of for the process object as a whole. But maybe that's all that's necessary?

I think the real bug is in StopPrivateStateThread, which only sends the stop state if the thread is still joinable; since I removed the Reset, it is of course still joinable after it exits. But even with the Reset (which we agree shouldn't be there), there is still a race between the time the detach signal is sent and when the thread sees it and exits. It seems once a detach or stop is sent, no further detaches/stops can be sent, regardless of what the thread is doing (since it will exit as soon as it sees the first one). So there should be a flag to that effect somewhere checked in ControlPrivateStateThread. Probably all the calls to IsJoinable() should be replaced with calls to PrivateStateThreadIsValid, which should be updated to check that flag.

I don't see a nice way to add this flag for each private state thread, though, instead of for the process object as a whole. But maybe that's all that's necessary?

You can just check the value of m_private_state_control_wait:

void
Process::ControlPrivateStateThread (uint32_t signal)
{
    // Signal the private state thread. First we should copy this is case the
    // thread starts exiting since the private state thread will NULL this out
    // when it exits
    HostThread private_state_thread(m_private_state_thread);
    if (private_state_thread.IsJoinable() && m_private_state_control_wait.GetValue() == false)
    {

m_private_state_control_wait is always false if the private state thread is running. It is set to true immediately after sending a control and syncing with the sender and then it is set back to false. It is set to true again when the thread exits.

Ooh, that might work. But when ControlProvateStateThread resets m_private_state_control_wait to false there's still a race between that and the thread exiting. It could then be set back to false even after the thread has exited (this is even likely for a detach).

I tried a different approach, changing PrivateStateThreadIsValid to the following. I'm not sure it's the right thing to do, but it certainly fixes the race (and thus timeout) on our platform:

bool
PrivateStateThreadIsValid () const
{
    lldb::StateType state = m_private_state.GetValue();
    return state != lldb::eStateDetached &&
           state != lldb::eStateExited &&
           m_private_state_thread.IsJoinable();
}
cameron314 updated this object.
cameron314 edited edge metadata.
cameron314 removed rL LLVM as the repository for this revision.

I've updated the diff to include a fix for the race between Detach and Stop. This has been working nicely for us without any problems for the last few days.

I looked again at putting back the wrapper for the m_private_state_thread, but it really doesn't make sense to me. Anything that can go wrong without the wrapper can still happen with the wrapper.

cameron314 edited edge metadata.

Oops, uploaded wrong diff a minute ago. Here's the one with the full fix.

clayborg accepted this revision.Apr 18 2016, 1:20 PM
clayborg edited edge metadata.

Looks good.

This revision is now accepted and ready to land.Apr 18 2016, 1:20 PM
This revision was automatically updated to reflect the committed changes.
labath added a subscriber: labath.Apr 19 2016, 7:12 AM

Hi,

I have reverted this as it was causing a number of failures on the linux build bot http://lab.llvm.org:8011/builders/lldb-x86_64-ubuntu-14.04-cmake/builds/13585. I think we'll need to investigate them before putting this in. Do you have the ability to run the linux test suite? If not, I can send you some logs and core files from the build machine...

Hmm, that's not good.
No, I don't have a Linux machine set up. I can set up a VM but if you already have the dumps/logs it would be faster to start with those.

Thanks!