This is an archive of the discontinued LLVM Phabricator instance.

[WIP][lldb/Reproducers] Synchronize the command interpreter with asynchronous events
AbandonedPublic

Authored by JDevlieghere on Jul 8 2020, 6:25 PM.

Details

Summary

Add synchronization for asynchronous events. This fixes an un expected packet during replay when an asynchronous event triggers a GDB packet. Consider the following example:

$ ./bin/lldb --capture
(lldb) gdb-remote 13000
Process 770 stopped
* thread #1, stop reason = signal SIGSTOP
    frame #0: 0x0000000105071000 dyld`_dyld_start
(lldb) cont
Process 770 resuming
Process 770 exited with status = 0 (0x00000000)
(lldb) reproducer generate

$ ./bin/lldb --replay /path/to/reproducer
(lldb) gdb-remote 13000
(lldb) cont
Reproducer expected packet: '$jThreadExtendedInfo:{"thread":14341}#01'
Reproducer received packet: 'c'
LLVM ERROR: Encountered unexpected packet during replay

The way the thread state is displayed at the prompt is asynchronous. It reacts to an event, in this case a process state change on connection.

I haven't bothered too much with naming. The same logic should be applicable to other situations that require synchronization.

Diff Detail

Event Timeline

JDevlieghere created this revision.Jul 8 2020, 6:25 PM

Well, this is an interesting problem... IIUC, what's happening is that the printing of the "stop reason" in the "event handler thread" is generating some packets (to fetch the stop reason), and then the "cont" command produces some (c) too...

I'm wondering if this problem does not go beyond reproducers... The fact that we can have two threads running in parallel, doing stuff, without much synchronization seems like a bad idea in general. In theory, something similar could happen during normal operation as well, if the user was fast enough to type "cont<Return>" before the stop reason is printed. In the non-reproducer mode we wouldn't crash with the unexpected packet assertion, but we would probably do something unexpected nonetheless. Of course, a real human user won't be fast enough to do that, but I wouldn't rule out this being the cause of the flakiness of some of our pexpect tests. And I'm not sure what happens when sourcing command files, which is fairly similar to running a reproducer. How would the events be processed there?

So, I'm wondering if we shouldn't add some form of synchronization here, which would be independent of the reproducer mode we're running in. That would solve the reproducer problem as a consequence, but it seems like it would be generally useful. One way to do that could be to send some sort of a ping event (EventDataReceipt ?) to the event handler thread before or after running a command, and wait for it to report back. That should ensure that all events produced by the previous command have been fully processed...

Well, this is an interesting problem... IIUC, what's happening is that the printing of the "stop reason" in the "event handler thread" is generating some packets (to fetch the stop reason), and then the "cont" command produces some (c) too...

Yup

I'm wondering if this problem does not go beyond reproducers... The fact that we can have two threads running in parallel, doing stuff, without much synchronization seems like a bad idea in general. In theory, something similar could happen during normal operation as well, if the user was fast enough to type "cont<Return>" before the stop reason is printed. In the non-reproducer mode we wouldn't crash with the unexpected packet assertion, but we would probably do something unexpected nonetheless. Of course, a real human user won't be fast enough to do that, but I wouldn't rule out this being the cause of the flakiness of some of our pexpect tests. And I'm not sure what happens when sourcing command files, which is fairly similar to running a reproducer. How would the events be processed there?

Agreed. I believe at some point (long before I started working on LLDB) all the GDB communication was limited to a single thread, but the overhead of having to talk and synchronize with that thread was too costly and it got relaxed.

So, I'm wondering if we shouldn't add some form of synchronization here, which would be independent of the reproducer mode we're running in. That would solve the reproducer problem as a consequence, but it seems like it would be generally useful. One way to do that could be to send some sort of a ping event (EventDataReceipt ?) to the event handler thread before or after running a command, and wait for it to report back. That should ensure that all events produced by the previous command have been fully processed...

How would this be different from synchronous mode and how would you make this work in asynchronous mode?

As an aside: the first thing I considered here was making this particular command synchronous by blocking until we get the stop event. During replay the execution is always synchronous (which is something I don't like and was hoping to get rid off with this patch). That would only solve the problem for this command and adds a bunch of assumptions about connections resulting in stops. I know this is always true for gdb platforms, but maybe not for other platforms. Also currently the debugger is the only one that deals with these events, I'm not super excited to have that logic spread out with some commands handling it synchronously and others doing it asynchronously.

Based on your description I don't yet see how the EventDataReceipt thingy would work. If we block from the moment the event is posted, then what's the goal of having the event handling logic running in a separate thread? If we block between when the event is being handled until it finished, then how would you avoid a race between when the event is broadcast and when it's being handled? Can you elaborate a bit on this?

I think the proper way to gate this sort of "race" is to use the process state. If the current process state is "stopped" then that means we're done with the stop processing and ready to allow commands that require the process to be stopped.

When I originally did this, the plan was that the the Public state would not be set to stopped until the stop event was consumed by the process listener. So if you tried to "continue" the process before the work to handle the stop event was done, you would see the state as "running" and the continue would not succeed.

However, that ran up against the problem that part of the "asynchronous" work that you might do on stop can involve running commands and SB API. Those need to see a public stop state or they won't work correctly. So the public state gets set to stopped too early, and then it's possible for this to be racy. IMO the solution to this problem should be for the Event Handler to be able to run its commands "as if" the public state were stopped, but everybody else who is trying to run commands will still see it as running. Then when the event handler is done with it's job, the process Public state is switched to stopped, and now commands coming from other sources will run in the stopped state.

I think the proper way to gate this sort of "race" is to use the process state. If the current process state is "stopped" then that means we're done with the stop processing and ready to allow commands that require the process to be stopped.

When I originally did this, the plan was that the the Public state would not be set to stopped until the stop event was consumed by the process listener. So if you tried to "continue" the process before the work to handle the stop event was done, you would see the state as "running" and the continue would not succeed.

However, that ran up against the problem that part of the "asynchronous" work that you might do on stop can involve running commands and SB API. Those need to see a public stop state or they won't work correctly. So the public state gets set to stopped too early, and then it's possible for this to be racy. IMO the solution to this problem should be for the Event Handler to be able to run its commands "as if" the public state were stopped, but everybody else who is trying to run commands will still see it as running. Then when the event handler is done with it's job, the process Public state is switched to stopped, and now commands coming from other sources will run in the stopped state.

How would that work with the reproducer issue? How would you prevent the command interpreter from processing the next command before the event handler thread issues the process info packet?

labath added a comment.Jul 9 2020, 1:11 PM

I'm wondering if this problem does not go beyond reproducers... The fact that we can have two threads running in parallel, doing stuff, without much synchronization seems like a bad idea in general. In theory, something similar could happen during normal operation as well, if the user was fast enough to type "cont<Return>" before the stop reason is printed. In the non-reproducer mode we wouldn't crash with the unexpected packet assertion, but we would probably do something unexpected nonetheless. Of course, a real human user won't be fast enough to do that, but I wouldn't rule out this being the cause of the flakiness of some of our pexpect tests. And I'm not sure what happens when sourcing command files, which is fairly similar to running a reproducer. How would the events be processed there?

Agreed. I believe at some point (long before I started working on LLDB) all the GDB communication was limited to a single thread, but the overhead of having to talk and synchronize with that thread was too costly and it got relaxed.

I don't believe that is relevant here. We still have a gdb lock, which ensures that at most one thread is communicating with the server at any given time, even though it's not always the same thread. But that lock is far too granular -- it works at the packet level. What we need here is a way to ensure that process of fetching the stop reason (as a whole) does not interleave with the subsequent command.

So, I'm wondering if we shouldn't add some form of synchronization here, which would be independent of the reproducer mode we're running in. That would solve the reproducer problem as a consequence, but it seems like it would be generally useful. One way to do that could be to send some sort of a ping event (EventDataReceipt ?) to the event handler thread before or after running a command, and wait for it to report back. That should ensure that all events produced by the previous command have been fully processed...

How would this be different from synchronous mode and how would you make this work in asynchronous mode?

The way I see it, this is mostly orthogonal to the (a)synchronous mode issue. I consider the problem to be in the gdb-remote command, which behaves the same way in both modes. It causes a eBroadcastBitStateChanged/eStateStopped to be sent, and in response to that the default event handler prints the stop reason string. Normally this happens very quickly. So quickly, that one would be forgiven to consider the stop reason string to be the "output" of the gdb-remote command. My proposal is to not consider this command to be "complete" until that event is processed, essentially making that string _be_ the output of the command.

For the asynchronous mode, I do think that we will need some form of a global "clock". I'm just not sure that this is the right form for it. To me it seems that this is mostly about the relationship of the command interpreter and the gdb-remote protocol -- if I do an asynchronous "process continue", followed (10 seconds later) by an "process interrupt", the gdb replayer needs to know that it should not respond to the $c packet, but wait for the ^C interrupt packet which will come from the interrupt command.

Maybe you could try explaining how you wanted to make this work for asyncrhonous mode. (Btw, I think the example above may not even need a global clock -- the gdb replayer can just note that the $c packet should not be replied to, and then send the stop-reply as a response to the ^C packet.)

As an aside: the first thing I considered here was making this particular command synchronous by blocking until we get the stop event.

Now, here I get a feeling that one of us is misunderstanding the (a)synchronous concept. The way I remember this is that when we want to do a synchronous command, we wait for the Stopped event by injecting a custom listener between the source, and the final listener. This custom listener fetches the event, does something, and then broadcasts it to the final listener. This ensures that the process is stopped and that most of the work that should happen when it stops has happened, but it does _not_ synchronize with the processing done by the final listener (which is what we want here).

During replay the execution is always synchronous (which is something I don't like and was hoping to get rid off with this patch). That would only solve the problem for this command and adds a bunch of assumptions about connections resulting in stops. I know this is always true for gdb platforms, but maybe not for other platforms.

Actually, I wanted this to be fully general -- after the execution of _any_ command, we make sure to drain _all_ events from the default event handler queue. This handling can still be done in the Debugger object, all the command interpreter would need to do is call debugger->DrainEventHandler() at an appropriate time.

Also currently the debugger is the only one that deals with these events, I'm not super excited to have that logic spread out with some commands handling it synchronously and others doing it asynchronously.

Based on your description I don't yet see how the EventDataReceipt thingy would work. If we block from the moment the event is posted, then what's the goal of having the event handling logic running in a separate thread? If we block between when the event is being handled until it finished, then how would you avoid a race between when the event is broadcast and when it's being handled? Can you elaborate a bit on this?

For this particular case, I don't think there's an advantage (and there certainly are disadvantages) to doing this in a separate thread. IIUC (I also wasn't present at that time), this event handler thread exists so that we can respond to changes done through the SB API. So, if the something does the SB equivalent of "gdb-remote" or "frame select" or whatever, we still can print a message to the console that the state of the process has changed. Reusing this mechanism for the command line kinda makes sense, since we want things to also work the other way around -- have the IDE respond to process changes which are being done through the command line.

The way I was imagining this working is that the DrainEventHandler function would send a special event to the default event handler thread, and wait until it gets processed. Since the events are processed in FIFO order, this would ensure that all preceding events have been processed as well. Note that this would go to the default event handler even if the process events happen to be going to a different listener (e.g. one provided by the IDE). In that case the default event handler would just return straight away -- which is fine, because in this case the stop reason would not get printed.

That said, the talk of SB API has reminded me that a similar race can exist between the SB commands and the event handler thread. Given that the SB replayer also streams SB commands as fast as it can, the same race can occur if the user does the SB equivalents of the above command line commands (this is a problem with both solutions).

Now obviously, we don't want to drain the event queue after every SB call. I'm not exactly sure what that means though...

Agreed. I believe at some point (long before I started working on LLDB) all the GDB communication was limited to a single thread, but the overhead of having to talk and synchronize with that thread was too costly and it got relaxed.

I don't believe that is relevant here. We still have a gdb lock, which ensures that at most one thread is communicating with the server at any given time, even though it's not always the same thread. But that lock is far too granular -- it works at the packet level. What we need here is a way to ensure that process of fetching the stop reason (as a whole) does not interleave with the subsequent command.

It is relevant for the reproducer and the unexpected packet. If there was one thread dealing with these events, we could do the synchronization there. The problem is that at the server side it's already too late. Anyway, this is not something that I want to pursue :-)

How would this be different from synchronous mode and how would you make this work in asynchronous mode?

The way I see it, this is mostly orthogonal to the (a)synchronous mode issue. I consider the problem to be in the gdb-remote command, which behaves the same way in both modes. It causes a eBroadcastBitStateChanged/eStateStopped to be sent, and in response to that the default event handler prints the stop reason string. Normally this happens very quickly. So quickly, that one would be forgiven to consider the stop reason string to be the "output" of the gdb-remote command. My proposal is to not consider this command to be "complete" until that event is processed, essentially making that string _be_ the output of the command.

That's the approach I described below. Blocking on the stop event seems at odds with the asynchronous mode. Does a connect result in a stop for every platform we support?

For the asynchronous mode, I do think that we will need some form of a global "clock". I'm just not sure that this is the right form for it. To me it seems that this is mostly about the relationship of the command interpreter and the gdb-remote protocol -- if I do an asynchronous "process continue", followed (10 seconds later) by an "process interrupt", the gdb replayer needs to know that it should not respond to the $c packet, but wait for the ^C interrupt packet which will come from the interrupt command.

Maybe you could try explaining how you wanted to make this work for asyncrhonous mode.

Although I want replay to behave a close to capture as possible, I'm fine with continuing to replay in synchronous mode.

(Btw, I think the example above may not even need a global clock -- the gdb replayer can just note that the $c packet should not be replied to, and then send the stop-reply as a response to the ^C packet.)

That's already how it works today. For the reproducers the (a)synchronous mode is all about the command interpreter and when it issues the next command.

As an aside: the first thing I considered here was making this particular command synchronous by blocking until we get the stop event.

Now, here I get a feeling that one of us is misunderstanding the (a)synchronous concept. The way I remember this is that when we want to do a synchronous command, we wait for the Stopped event by injecting a custom listener between the source, and the final listener. This custom listener fetches the event, does something, and then broadcasts it to the final listener. This ensures that the process is stopped and that most of the work that should happen when it stops has happened, but it does _not_ synchronize with the processing done by the final listener (which is what we want here).

During replay the execution is always synchronous (which is something I don't like and was hoping to get rid off with this patch). That would only solve the problem for this command and adds a bunch of assumptions about connections resulting in stops. I know this is always true for gdb platforms, but maybe not for other platforms.

Actually, I wanted this to be fully general -- after the execution of _any_ command, we make sure to drain _all_ events from the default event handler queue. This handling can still be done in the Debugger object, all the command interpreter would need to do is call debugger->DrainEventHandler() at an appropriate time.

Also currently the debugger is the only one that deals with these events, I'm not super excited to have that logic spread out with some commands handling it synchronously and others doing it asynchronously.

Based on your description I don't yet see how the EventDataReceipt thingy would work. If we block from the moment the event is posted, then what's the goal of having the event handling logic running in a separate thread? If we block between when the event is being handled until it finished, then how would you avoid a race between when the event is broadcast and when it's being handled? Can you elaborate a bit on this?

For this particular case, I don't think there's an advantage (and there certainly are disadvantages) to doing this in a separate thread. IIUC (I also wasn't present at that time), this event handler thread exists so that we can respond to changes done through the SB API. So, if the something does the SB equivalent of "gdb-remote" or "frame select" or whatever, we still can print a message to the console that the state of the process has changed. Reusing this mechanism for the command line kinda makes sense, since we want things to also work the other way around -- have the IDE respond to process changes which are being done through the command line.

The way I was imagining this working is that the DrainEventHandler function would send a special event to the default event handler thread, and wait until it gets processed. Since the events are processed in FIFO order, this would ensure that all preceding events have been processed as well. Note that this would go to the default event handler even if the process events happen to be going to a different listener (e.g. one provided by the IDE). In that case the default event handler would just return straight away -- which is fine, because in this case the stop reason would not get printed.

That said, the talk of SB API has reminded me that a similar race can exist between the SB commands and the event handler thread. Given that the SB replayer also streams SB commands as fast as it can, the same race can occur if the user does the SB equivalents of the above command line commands (this is a problem with both solutions).

I actually designed this solution with that in mind. I imagine having a synchronizer for every command interpreter and a different one for the API replay. Why would the current patch not work for that?

Now obviously, we don't want to drain the event queue after every SB call. I'm not exactly sure what that means though...

I think the proper way to gate this sort of "race" is to use the process state. If the current process state is "stopped" then that means we're done with the stop processing and ready to allow commands that require the process to be stopped.

When I originally did this, the plan was that the the Public state would not be set to stopped until the stop event was consumed by the process listener. So if you tried to "continue" the process before the work to handle the stop event was done, you would see the state as "running" and the continue would not succeed.

However, that ran up against the problem that part of the "asynchronous" work that you might do on stop can involve running commands and SB API. Those need to see a public stop state or they won't work correctly. So the public state gets set to stopped too early, and then it's possible for this to be racy. IMO the solution to this problem should be for the Event Handler to be able to run its commands "as if" the public state were stopped, but everybody else who is trying to run commands will still see it as running. Then when the event handler is done with it's job, the process Public state is switched to stopped, and now commands coming from other sources will run in the stopped state.

How would that work with the reproducer issue? How would you prevent the command interpreter from processing the next command before the event handler thread issues the process info packet?

Sorry, I think these issues are orthogonal. I was mostly responding to Pavel's comments.

Seems like my thoughts got lost a bit with all the inline replies: we can solve this particular issue by making process connect block in synchronous mode. The fact that that's not happening today is a bug beyond the reproducers. I don't think we should change the current behavior in asynchronous mode. I think it's probably worthwhile to do that even if we decide to add extra synchronization for the reproducers in which case this would no longer be an example of a problem solved by the current patch.

I want to be careful not to conflate lldb's general event behavior with the way the command interpreter happens to run things.

For instance, there's no requirement that the debugger be the event listener for processes. You can have a listener per process under the same debugger. Again, I think the synchronization between what the command interpreter is allowed to do and what's going on in individual processes is best governed by whether each individual process is stopped or not...

Seems like my thoughts got lost a bit with all the inline replies: we can solve this particular issue by making process connect block in synchronous mode. The fact that that's not happening today is a bug beyond the reproducers. I don't think we should change the current behavior in asynchronous mode. I think it's probably worthwhile to do that even if we decide to add extra synchronization for the reproducers in which case this would no longer be an example of a problem solved by the current patch.

I'm starting to understand why you say "process connect" is not blocking/synchronous. Although it is waiting for the "stopped" event to be sent to the public queue (here), it is not waiting for it to be processed by the public queue. The "process continue" command OTOH, is kind of waiting for the public queue processing to finish. I say only "kind of" because it does not actually process it on the _real_ public queue -- instead it "hijacks" the events and processes then on its own little event loop. The part that I got wrong was that I was assuming that the hijacked event loop would rebroadcast the stop event to the real public queue. In fact, it doesn't, which means that the stop event processing is completely finished by the time that the "process continue" command returns.

It seems reasonable that "process connect" (and it's SB equivalent) should behave the same way. This would also explain why so many of the "gdb-client" tests need to use lldbutil.expect_state_changes when connecting to the mock server, even though they are running in synchronous mode (the only other tests using this function explicitly choose to use the async mode).

For async mode, we can say that the user needs to ensure that the event is handled before he can issue other commands -- this is the same situation as with "process continue". Of course, in the asynch case, the reproducers are "the user" and they will need to do something about this. But maybe we don't need to cross that bridge right now?

Seems like my thoughts got lost a bit with all the inline replies: we can solve this particular issue by making process connect block in synchronous mode. The fact that that's not happening today is a bug beyond the reproducers. I don't think we should change the current behavior in asynchronous mode. I think it's probably worthwhile to do that even if we decide to add extra synchronization for the reproducers in which case this would no longer be an example of a problem solved by the current patch.

I'm starting to understand why you say "process connect" is not blocking/synchronous. Although it is waiting for the "stopped" event to be sent to the public queue (here), it is not waiting for it to be processed by the public queue. The "process continue" command OTOH, is kind of waiting for the public queue processing to finish. I say only "kind of" because it does not actually process it on the _real_ public queue -- instead it "hijacks" the events and processes then on its own little event loop. The part that I got wrong was that I was assuming that the hijacked event loop would rebroadcast the stop event to the real public queue. In fact, it doesn't, which means that the stop event processing is completely finished by the time that the "process continue" command returns.

It seems reasonable that "process connect" (and it's SB equivalent) should behave the same way. This would also explain why so many of the "gdb-client" tests need to use lldbutil.expect_state_changes when connecting to the mock server, even though they are running in synchronous mode (the only other tests using this function explicitly choose to use the async mode).

For async mode, we can say that the user needs to ensure that the event is handled before he can issue other commands -- this is the same situation as with "process continue". Of course, in the asynch case, the reproducers are "the user" and they will need to do something about this. But maybe we don't need to cross that bridge right now?

The "synchronous" mode contract is that by the time the command completes, there should be no more event work to be done. That's why the event should be consumed and not rebroadcast. It seems to me "process connect" is just breaking the synch mode contract.

I think reproducers in truly async mode (particularly where you can have multiple processes each with different listeners) is going to be challenging for reproducers. But sync mode shouldn't be that hard, provided it is implemented correctly everywhere.

JDevlieghere abandoned this revision.Jan 6 2022, 8:53 AM