This is an archive of the discontinued LLVM Phabricator instance.

[lldb] Fix ProcessEventData::DoOnRemoval to support multiple listeners
Needs RevisionPublic

Authored by ilya-nozhkin on Aug 26 2020, 12:24 PM.

Details

Summary

Process does something with its state and threads when an event is
removed from some public listener's event queue. The comment at the
beginning of DoOnRemoval and the condition checking m_update_state
tell that all these actions on the process state and threads must
be executed only once and after the private listener has already
handled an event. However, there is no any modification of
m_update_state except for invocation of SetUpdateStateOnRemoval
in HandlePrivateEvent. It means that m_update_state becomes 1 after an
event is handled by the private listener and is not changed anymore.
So, if more than one listener listen for process events
(for example, if someone add a listener following the example [1])
then the code in DoOnRemoval may be executed more than once. Moreover,
since events are handled asynchronously, DoOnRemoval may be executed by
two threads simultaneously that leads to data race.

This commit replaces m_update_state-based mechanism with a check of the
event broadcaster at the beginning of DoOnRemoval and a boolean flag
telling whether DoOnRemoval's logic has been already executed.
Also, there added a mutex which is locked during the whole execution of
DoOnRemoval. It makes DoOnRemoval thread-safe and ensures that the
public state of a process will be correct for all threads invoking
DoOnRemoval.

[1] https://lldb.llvm.org/python_reference/lldb.SBEvent-class.html

Diff Detail

Event Timeline

ilya-nozhkin created this revision.Aug 26 2020, 12:24 PM
ilya-nozhkin requested review of this revision.Aug 26 2020, 12:24 PM
jingham requested changes to this revision.Aug 26 2020, 2:17 PM

This change is fine for what it does, but I don't think the model that it allows is really supportable. If you have multiple listeners on the process events, and Listener A wants to do a "step" when notified that the process has stopped, but Listener B wants to do a "Continue", then there's no way to coordinate these and we're just allowing race conditions in controlling the process. I think you really need to have one agent that controls the process, and then other threads that are passive listeners.

My intention for this was to have the primary Listener (the one that was registered with the process when it was created) be the "controlling listener". We would ensure that that listener gets notified first, and add some kind of handshake where the controlling listener relinquishes control, after which the other listeners will be informed.

Anyway, my feeling is we should think a bit more about policy w.r.t. fetching process events from multiple Process listeners before we allow them.

There should also be some test cases for handling a process with multiple listeners, since that's not something we've supported before.

This revision now requires changes to proceed.Aug 26 2020, 2:17 PM

Yeah, I understand the problem of two listeners trying to change process state simultaneously and I agree that informing other listeners once the primary one finished its job would be a great solution. But the problem is that primary listener just provides an event to some user code (via GetEvent) and there is no way to find out when this code has finished without requiring it to invoke a handshake explicitly.

One of the possible solutions is to send a specifically marked event to a primary listener first. Then, make this event to invoke a handshake in its destructor (which would mean that primary listener doesn't need the event anymore). And handshake can be implemented as just a rebroadcasting a new instance of the event to other listeners (or via unlocking some additional mutex / notifying some conditional variable). However, destructor may be invoked not immediately after event processing is finished. For example, if a primary listener's thread defines an EventSP before the listening cycle and doesn't reset it until GetEvent returns a new one (which may not happen until the next stop, for example).

Anyway, the concept of primary and secondary listeners requires to forbid secondary listeners to do "step" or "continue" at all. So, why can't we do the same without any additional code? I mean, I think that the most common case of multiple listeners is a Python extension adding a listener to read/write some memory/registers on each stop taking the RunLock. So, it does not interfere with a primary listener except for simultaneous DoOnRemoval invocation (which is fixed by this patch). Moreover, as I understand, there is no any way to implement such extension without adding a new listener (because if one would try to use Debugger's default listener then all other users such as lldb-vscode, for example, couldn't handle process events anymore).

This still seems to me a pretty fragile way to program lldb. For instance, if the controlling thread gets the stop event first, and decides it wants to continue, it will set the process running again and by the time the memory reading thread gets the event, the process will have already proceeded and the memory read will fail. But it it happens the other way around, it will succeed. That will make for subtle bugs in your implementation.

Having a bunch of event listeners hanging out doing ancillary stop tasks also makes it hard to do cancellation. For instance, if the user presses "step" again while you are still doing the work to present the current stop, you want to discard all the work you were doing for the first stop and just step again. People like to drive from an interesting place to another interesting place by whaling on the "next" button and want that to be very responsive and don't care if all the data isn't updated in getting there. As long as the UI knows what work it cancelled and doesn't mark data as updated if it wasn't, this does seem to better model how people told us they want GUI debuggers to behave.

You could also implement this sort of thing with stop hooks that get run before any of the other actions are taken by then controlling thread (or by implementing this yourself.) There isn't currently a way to write stop hooks in Python but it would be simple to add that. The stop hooks don't solve the cancellation problem, however.

Anyway, "I wouldn't do it that way" isn't a good reason not to make a thing possible. If you feel like you can have success with this added ability, I can't see how it would do harm.

You still need to add some tests for this new behavior. We have been telling people not to do it this way pretty much forever, and so there are no tests for multiple process listeners. It would be good to write a not entirely trivial test that tries to do a little simultaneous work in your different Listeners, and deal with DoOnRemoval restarting the target, as well as one of the Listeners doing so, to ensure we don't fall over for other reasons when we actually try to use the feature.

This is going to derail this patch somewhat (sorry), but since this is an important long-standing open issue, I think we should discuss it in more detail.

The part that's bothering me about this for quite some time is... Why do we even have this DoOnRemoval stuff in the first place?

I mean instead of doing this work in DoOnRemoval, why don't we do this work /before/ enqueueing the relevant event. Then the event would be of purely informative character and it would not matter if its received by one, two, ten, or zero listeners. Of course, if a user wanted to program lldb in a correct and race-free manner, it would /have to/ listen to these events (as otherwise it would not know when is it safe to run some actions), but that would be something that's entirely up to him.

Back when I was debugging some races in tests with asynchronous listeners, I remember it was pretty hard to reason about things and prove they are correct, since the dequeueing of the event (and the resulting flurry of actions within lldb) could come at any moment. If those actions were done internally to lldb, they would be more sequenced more deterministically and be easier to reason about than if they are executed whenever the user chooses to dequeue an event. (I remember things got particularly messy when a test finished before dequeuing an event, at which point the processing of the event was executing concurrently with the process.Kill() statement we execute at the end of each test.)

One of the possible solutions is to send a specifically marked event to a primary listener first. Then, make this event to invoke a handshake in its destructor (which would mean that primary listener doesn't need the event anymore). And handshake can be implemented as just a rebroadcasting a new instance of the event to other listeners (or via unlocking some additional mutex / notifying some conditional variable). However, destructor may be invoked not immediately after event processing is finished. For example, if a primary listener's thread defines an EventSP before the listening cycle and doesn't reset it until GetEvent returns a new one (which may not happen until the next stop, for example).

This primary listener technique could be implemented purely in your own code, could it not? You could have one listener, which listens to lldb events, and then rebroadcasts it (using arbitrary techniques, not necessarily SBListener) to any number of interested threads that want to act on it. That might make it easier to ensure there are no races in your code, and would avoid stepping on some of the internal lldb races that I am sure we have... Otherwise I (as Jim) expect that you will still need to have some sort of communication between the various listeners listening to process events, as they'd need to coordinate

Anyway, the concept of primary and secondary listeners requires to forbid secondary listeners to do "step" or "continue" at all. So, why can't we do the same without any additional code? I mean, I think that the most common case of multiple listeners is a Python extension adding a listener to read/write some memory/registers on each stop taking the RunLock.

Yes, you could take the run lock, which would guarantee that the other listeners don't get to resume the process. But how do you guarantee that you _can_ take the run lock -- some other thread (the primary listener) could beat you to it resume the process, so you'd miss the chance to read/write those registers. Seems very fragile. A stop hook does look like a better way to achieve these things.

jingham added a comment.EditedAug 28 2020, 9:55 AM

This is going to derail this patch somewhat (sorry), but since this is an important long-standing open issue, I think we should discuss it in more detail.

The part that's bothering me about this for quite some time is... Why do we even have this DoOnRemoval stuff in the first place?

I mean instead of doing this work in DoOnRemoval, why don't we do this work /before/ enqueueing the relevant event. Then the event would be of purely informative character and it would not matter if its received by one, two, ten, or zero listeners. Of course, if a user wanted to program lldb in a correct and race-free manner, it would /have to/ listen to these events (as otherwise it would not know when is it safe to run some actions), but that would be something that's entirely up to him.

Back when I was debugging some races in tests with asynchronous listeners, I remember it was pretty hard to reason about things and prove they are correct, since the dequeueing of the event (and the resulting flurry of actions within lldb) could come at any moment. If those actions were done internally to lldb, they would be more sequenced more deterministically and be easier to reason about than if they are executed whenever the user chooses to dequeue an event. (I remember things got particularly messy when a test finished before dequeuing an event, at which point the processing of the event was executing concurrently with the process.Kill() statement we execute at the end of each test.)

It would certainly be worth rethinking this, these were decisions made very early in the development of lldb...

I did it the way I did for several reasons:

First of all was to get the work that was done by the DoOnRemoval - which is basically anything lldb can do since it's where the breakpoint callbacks are triggered - off the private state thread. The private state thread needs to be alive and getting private events and running the thread plans to do the work required by breakpoint callbacks, and while I do have a system to spin up another private state threads if you need to use one while doing some piece of work on the private state thread, that seemed fragile to me and I didn't want to lean on it too heavily. It is important to make explicit the transition from lldb doing internal work with the incoming event stream to giving over control to the user. Internally, we have to let ourselves do things based off the private state, but external users are gated by the public state instead, Having it happen when the event gets handed to the user was a convenient place to mark this transition.

The other reason I did it this way was that I didn't want the process state and the event stream to get out of sync. I we do all the work on an event including setting the public state to stopped before posting the event, then you will easily get into the situation where the public state said "stopped" when you haven't yet fetched the stopped event off of the event queue. This inconsistencies seemed likely to result in confusing errors in use of the API's.

If we can come up with another way to achieve these goals but do the user-facing work of processing the event somewhere else, I don't have any objections.

One of the possible solutions is to send a specifically marked event to a primary listener first. Then, make this event to invoke a handshake in its destructor (which would mean that primary listener doesn't need the event anymore). And handshake can be implemented as just a rebroadcasting a new instance of the event to other listeners (or via unlocking some additional mutex / notifying some conditional variable). However, destructor may be invoked not immediately after event processing is finished. For example, if a primary listener's thread defines an EventSP before the listening cycle and doesn't reset it until GetEvent returns a new one (which may not happen until the next stop, for example).

This primary listener technique could be implemented purely in your own code, could it not? You could have one listener, which listens to lldb events, and then rebroadcasts it (using arbitrary techniques, not necessarily SBListener) to any number of interested threads that want to act on it. That might make it easier to ensure there are no races in your code, and would avoid stepping on some of the internal lldb races that I am sure we have... Otherwise I (as Jim) expect that you will still need to have some sort of communication between the various listeners listening to process events, as they'd need to coordinate

Anyway, the concept of primary and secondary listeners requires to forbid secondary listeners to do "step" or "continue" at all. So, why can't we do the same without any additional code? I mean, I think that the most common case of multiple listeners is a Python extension adding a listener to read/write some memory/registers on each stop taking the RunLock.

Yes, you could take the run lock, which would guarantee that the other listeners don't get to resume the process. But how do you guarantee that you _can_ take the run lock -- some other thread (the primary listener) could beat you to it resume the process, so you'd miss the chance to read/write those registers. Seems very fragile. A stop hook does look like a better way to achieve these things.

I've realized that I just didn't know about stop hooks... So, sorry and thanks, it solves almost all my problems except for notifying about "resume" events which are important for me too. But I can use synchronous Process::Notifications to receive them (I couldn't use Notifications to receive stop events because they are sent before the public state is changed by DoOnRemoval). But then I thought why don't we just move SynchronouslyNotifyStateChanged invocation from ShouldBroadcastEvent to DoOnRemoval? It would provide the most convenient way of receiving process events without intersections with the primary listener. I can make these changes in another patch instead of this one.

This primary listener technique could be implemented purely in your own code, could it not?

I am trying to write an extension that can work with such things as lldb-vscode and lldb-mi without disturbing them. They use default Debugger's listener. So, I can't change the code of this listener on my side thus I can't force them to notify my listener.