This is an archive of the discontinued LLVM Phabricator instance.

[lldb] Fixed race condition on private state thread exit, take 2
ClosedPublic

Authored by cameron314 on Jun 13 2016, 9:24 AM.

Details

Summary

This is the follow-up to D19122, which was accepted but subsequently reverted due to a bug it introduced (that I didn't see during local testing on Windows but which manifested quite often on Linux). That bug (a race between the Process object was being destroyed and the thread terminating, caused by the join not being done under certain conditions) is fixed in this version of the patch.

This patch fixes various races between the time the private state thread is signaled to exit and the time it actually exits (during which it no longer responds to events). Previously, this was consistently causing 2-second timeout delays on process detach/stop for us.

This also prevents crashes that were caused by the thread controlling its own owning pointer while the controller was using it (copying the thread wrapper is not enough to mitigate this, since the internal thread object was getting reset anyway). Again, we were seeing this consistently.

For what it's worth, I've run the test suite with this change (on Linux) without any regressions, and the number of reruns dropped from 15 to 0 for me (though that last part may be coincidence).

Diff Detail

Repository
rL LLVM

Event Timeline

cameron314 updated this revision to Diff 60536.Jun 13 2016, 9:24 AM
cameron314 retitled this revision from to [lldb] Fixed race condition on private state thread exit, take 2.
cameron314 updated this object.
cameron314 added reviewers: clayborg, labath, zturner.
cameron314 added a subscriber: lldb-commits.
clayborg requested changes to this revision.Jun 13 2016, 1:40 PM
clayborg edited edge metadata.

This looks like it would work for normal operation. I am not sure it will work when an extra private state thread is spun up. In certain circumstances we need to create more private state threads. This happens when you have an expression that is run from the private state thread. Since the private state thread is what controls the actual process, we can't run an expression from the current private state thread, so we spin up new ones. The current code is doing some tricky things to deal with this, and that was part of the reason Process::ControlPrivateStateThread() was making a copy of the current value of "m_private_state_thread" into a local variable named "private_state_thread":

HostThread private_state_thread(m_private_state_thread);

See the code in "Process::RunThreadPlan()" around the code:

if (m_private_state_thread.EqualsThread(Host::GetCurrentThread()))

The new loop as written in Process::ControlPrivateStateThread() could end up using m_private_state_thread with differing contents in the "if (m_private_state_thread.IsJoinable())" if statement. Jim Ingham made these changes, so we should add him to the reviewer list. I am going to mark as "Request Changes" so we can address any such issues, but Jim should chime in on this before we proceed.

The way this normally happens is an expression is being run and while handling the expression on the private state thread, we need to run another expression (like a call to "mmap" in the debugged process so we can allocate memory), so we need to spin up another private state thread to handle the needed starts and stops. Only one of these threads will be actively grabbing events at a single time, so your patch might just work, but I want to get some more eyes on this to be sure.

Please add Jim Ingham as a reviewer.

This revision now requires changes to proceed.Jun 13 2016, 1:40 PM

@clayborg: Thanks for having a look! I've added Jim Ingham as a reviewer. @jingham, I'd appreciate if you could take a few minutes to look this over.

Right, I'd seen the backup/restore of the thread. As far as I can tell it should still work; the code in ControlPrivateStateThread has no idea it's working with a temporary thread, just as it didn't know before (in fact, if you look carefully at the code in the present tip of the trunk, a recent change seems to have introduced a mix of using both private_state_thread and m_private_state_thread, probably by accident). m_private_state_thread cannot be reset to the backup during a control event, since the first thing that's done before restoring the backup thread is to stop the temporary thread.

Ok, as long as Jim agrees, then I will give it the go ahead.

jingham accepted this revision.Jun 13 2016, 3:27 PM
jingham edited edge metadata.

This looks okay to me.

clayborg accepted this revision.Jun 13 2016, 3:37 PM
clayborg edited edge metadata.

Looks good if Jim is happy.

This revision is now accepted and ready to land.Jun 13 2016, 3:37 PM
labath accepted this revision.Jun 14 2016, 12:00 AM
labath edited edge metadata.

Seems to run fine on linux now. Thanks for investigating this. We'll monitor the buildbots and let you know if anything bad happens. ;)

BTW. your comment in ControlPrivateStateThread seems to indicate that the linux behavior is inconsistent/unexpected in some way. Do you think it would be worth filing a bug about that?

Thanks everyone :-)

Ah, yeah, sorry if I gave the wrong impression, but that comment is not specific to Linux (in fact, I've only seen it once, on Windows). At one point the debugger had entered ControlPrivateStateThread on one thread to stop it, seen that the thread was already in an invalid state (it was), and assumed that meant that the thread was already exiting and did a join without sending the stop. But the state thread somehow wasn't on its way out yet, it was stuck waiting for a control event first (this is the part that I'm not sure should be possible, but empirically is). This caused a deadlock. So I changed my patch to always send the event if the thread is joinable, not just if its state is valid, and left that comment to explain why this must remain so.

This revision was automatically updated to reflect the committed changes.
labath added a subscriber: labath.Jun 14 2016, 10:52 AM

OK, i see. Thanks for the explanation. This may actually be some
windows specific thing then, as I remember zachary mentioning they
have some flakyness issues there.

BTW, this has sped up the LLDB test suite nearly 2x, so thanks a lot
for that. :)

pl