This is an archive of the discontinued LLVM Phabricator instance.

[lldb, process] Fix occasional hang when launching a process in LLDB
ClosedPublic

Authored by stella.stamenova on May 31 2018, 3:45 PM.

Details

Summary

Occasionally, when launching a process in lldb (especially on windows, but not limited to), lldb will hang before the process is launched and it will never recover. This happens because the timing of the processing of the state changes can be slightly different. The state changes that are issued are:

  1. SetPublicState(eStateLaunching)
  2. SetPrivateState(eStateLaunching)
  3. SetPublicState(eStateStopped)
  4. SetPrivateState(eStateStopped)

What we expect to see is:
public state: launching -> launching -> stopped
private state: launching -> stopped

What we see is:
public state: launching -> stopped -> launching
private state: launching -> stopped

The second launching change to the public state is issued when WaitForProcessStopPrivate calls HandlePrivateEvent on the event which was created when the private state was set to launching. HandlePrivateEvent has logic to determine whether to broadcase the event and a launching event is *always* broadcast. At the same time, when the stopped event is processed by WaitForProcessStopPrivate next, the function exists and that event is never broadcast, so the public state remains as launching.

HandlePrivateEvent does two things: determine whether there's a next action as well as determine whether to broadcast the event that was processed. There's only ever a next action set if we are trying to attach to a process, but WaitForProcessStopPrivate is only ever called when we are launching a process or connecting remotely, so the first part of HandlePrivateEvent (handling the next action) is irrelevant for WaitForProcessStopPrivate. As far as broadcasting the event is concerned, since we are handling state changes that already occurred to the public state (and are now duplicated in the private state), I believe the broadcast step is unnecessary also (and in fact, it causes the hang).

This change removes the call to HandlePrivateEvent from inside WaitForProcessStopPrivate.

Incidentally, there was also a bug filed recently that is the same issue: https://bugs.llvm.org/show_bug.cgi?id=37496

Diff Detail

Repository
rL LLVM

Event Timeline

Launch and attaching are weird because they happen before you have a private state thread (or if there is one it is explicitly paused), so the events don't flow as they normally would. So neither launch nor attach are relying on a private eStateLaunching event.

Looking at the code, only FreeBSD and Windows call SetPrivateState(eStateLaunching). When I do "run" on macOS I see:

(lldb) log enable lldb state
(lldb) run

Process::SetPublicState (state = launching, restarted = 0)
Process::SetPrivateState (stopped)
Process::SetPrivateState (stopped) stop_id = 1
Process::SetPublicState (state = stopped, restarted = 0)
Process::SetPublicState (stopped) -- unlocking run lock
Process::SetPrivateState (running)
Process::SetPublicState (state = running, restarted = 0)

The private eStateLaunching event shouldn't be needed on Windows either, and I think that it is just causing trouble. What happens if you just remove the SetPrivateState(eStateLaunching) call? If that fixes the issue, I'd rather do it that way, since that makes all the platforms regular.

Then we need to find some convenient place to document these expectations...

BTW I'm not disagreeing with your analysis of the current uses of WaitForProcessStopPrivate. However, if you change it the way that you are suggesting, it will no longer be a generally useable function, it could only be used in in the narrow cases you describe. So if we have to change it as you are suggesting, we should change its name to something more descriptive of its reduced functionality. But again, if removing the SetPrivateState(eStateLaunching) fixes the problem, I'd rather do it that way.

As an aside, the comment on line 2691 seems to have wandered out of place - presumably when this code got refactored? It doesn't make sense here...

labath added a comment.Jun 1 2018, 1:18 AM

Launch and attaching are weird because they happen before you have a private state thread (or if there is one it is explicitly paused), so the events don't flow as they normally would. So neither launch nor attach are relying on a private eStateLaunching event.

Looking at the code, only FreeBSD and Windows call SetPrivateState(eStateLaunching).

I don't know if this matters, but windows and freebsd are also the only platforms using an in-process debugging plugin. (But I agree that unifying the event sequences across different platforms is the best approach.)

Removing the SetPrivateState(eStateLaunching) from Windows does resolve the issue of the hang. I considered that as well and it seemed that the more generic solution would be better since it would fix the problem for other platform such as FreeBSD as well. I am going to run the full suite of tests on Windows without the private state change and then update the review (I'll also fix the comment that has wandered out of place) but I am not going to touch the FreeBSD code path.

Rather than going with a generic solution, remove the change to private state to launching. This change also updates a comment that had ended up in the wrong place.

zturner accepted this revision.Jun 1 2018, 10:15 AM

I trust that this doesn't regress the test suite on Windows. You guys are more active in the Windows stuff for the time being, so if this works for you and doesn't break anything, I'm good with it.

This revision is now accepted and ready to land.Jun 1 2018, 10:15 AM
jingham accepted this revision.Jun 1 2018, 10:54 AM

Can we add a comment to lldb-enumerations.h after eStateLaunching and eStateAttaching are defined saying that these state changes are all sent while the private state thread is either not yet started or paused, so they should only be signaled as Public state changes, and not Private state changes? I think that's the clearest place we can document this behavior.

Other than that, this change is fine.

Thanks for digging into this, BTW, this sort of thing isn't easy to track down!

I trust that this doesn't regress the test suite on Windows. You guys are more active in the Windows stuff for the time being, so if this works for you and doesn't break anything, I'm good with it.

Without this change the test suite on Windows will not always finish as some tests will hit the bug and then hang...

This revision was automatically updated to reflect the committed changes.