This is an archive of the discontinued LLVM Phabricator instance.

Add an "interrupt timeout" to Process, fix a bug in handling interrupt timeouts in
ClosedPublic

Authored by jingham on May 7 2021, 11:26 AM.

Details

Summary

This patch started as an attempt to rationalize the handling of the timeout for interrupting a running process. Process::Halt sends a 10 second timeout to WaitForProcessToStop, but that timeout didn't have any effect on how the interrupt was actually handled in ProcessGDBRemote & its GDBRemoteClient client, and there was no way to plumb the desired timeout to that layer.

This patch adds an interrupt-timeout setting to process and plumbs that through.

That then allowed me to write some more tests on the interrupt handling, in the course of which I found a bug in the handling of the interrupt timeout.

The lowest level waiting on the remote stub doesn't block when the process is running. Instead, it passes in a 5 second wakeup timeout when it calls ReadPacket. That's there so we can detect if the connection has dropped, but in the current code it is also the de facto interrupt timeout as well.

When ProcessGDBRemote wants to interrupt a running process, it sends a break to debugserver, and sets some ivars that say "interrupt is in flight" and to provide the "interrupt timepoint" to the client.

When the GDBRemoteClientBase object wakes up from ReadPacket because of the timeout, it checks whether there was an interrupt in flight, and if so, next checks whether 5 seconds have elapsed since the interrupt request. If that latter test is NOT true, it is supposed to go back into ReadPacket to wait out the remaining time (*). But it flubs this part. Instead of calling "continue" to go back to top of the ReadPacket loop, it just breaks from the "result checking" switch. The next thing after the switch is a check to see if the response packet is empty, and if it is empty, we exit from the ReadPacket loop, returning eStateInvalid - the same state as if the interrupt had failed.

But if we woke up before the interrupt had a chance to take effect, there won't be any response yet, and so we error out. That looks counter to the intention of the code that was checking the timeouts. It's bad because it means that if you are unlucky and happened to request the interrupt just before the ReadPacket call exceeds the timeout, you could in fact end up waiting only a very short time for the interrupt.

I have lots of reports of this happening, but never reproducibly. The reports always come in contexts where lots of processes were being started and stopped quickly. For instance, Xcode runs all its test plans in the debugger when not running in a test harness, and runs the tests in parallel. So it was starting & stopping a lot of processes in parallel, and that this should occasionally cause the system to stall didn't seem altogether unreasonable. That was the motivation for adding the interrupt timeout in the first place. That still may be true, but it looks like this bug was making the chance of exceeding the timeout more likely.

I still think having the timeout is valuable, however, if for no other reason than that was how I was able to write a test for the behavior that triggered this bug...

(*) BTW, this is not an exact timer. We don't have a way to stop ReadPacket & readjust the wakeup time if the interrupt time happens to be shorter. And when ReadPacket wakes up and finds we have an interrupt pending, but haven't waited long enough, formally we should change our timeout to be the amount left to wait for the interrupt. I didn't change this behavior as I don't think we need to guarantee that kind of exactness. As long as we wait at least as long as the timeout requested, I think we're doing what folks want, and I don't think being more precise is worth the added complexity.

Diff Detail

Event Timeline

jingham requested review of this revision.May 7 2021, 11:26 AM
jingham created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptMay 7 2021, 11:26 AM
clayborg added inline comments.May 7 2021, 12:46 PM
lldb/source/Plugins/Process/gdb-remote/GDBRemoteClientBase.cpp
353

if "interrupt_timeout" is optional, it can just be passed along to SyncWithContinueThread(interrupt_timeout) and avoid the need for m_should_interrupt.

361

It we pass take a "llvm::Optional<std::chrono::seconds> interrupt_timeout = llvm::None" as a parameter to this, we can avoid the need for m_should_interrupt to exist and can test if it has no value

lldb/source/Plugins/Process/gdb-remote/GDBRemoteClientBase.h
36

There are a lot of interrupt_timeout parameters in many methods here. Should we pass the interrupt timeout to the constructor and keep it as a member variable?

52

This function and the one above could be made into one function if we change the interrupt timeout to be defined as:

llvm::Optional<std::chrono::seconds> interrupt_timeout = llvm::None
56–59

Is this function only used when sending async now?

72

Combine with above function and change to optional?:

llvm::Optional<std::chrono::seconds> interrupt_timeout = llvm::None
86

Can't we just set this and update it if the user changes the setting and avoid passing "std::chrono::seconds interrupt_timeout" to many of the member functions in this class and switch any such parameters to a bool?

lldb/source/Plugins/Process/gdb-remote/GDBRemoteCommunicationClient.h
325

We have a ton of new "std::chrono::seconds interrupt_timeout" parameters. Since GDBRemoteClientBase already has an ivar for this, can't we just use this and pass a bool or no param instead?

514–516

We have a ton of new "std::chrono::seconds interrupt_timeout" parameters. Since GDBRemoteClientBase already has an ivar for this, can't we just use this and pass a bool or no param instead?

517–519

Ditto

lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp
1213

We should just set the interrupt timeout when we create the m_gdb_comm and then we won't have to pass this in to each call? We just need to make sure we update the timeout if it gets changed during a debug session, but that isn't hard.

1217

ditto and for many calls below

I don't see the change as "adding a bunch of timeouts", since what it mostly did was remove a bunch of send_async = false's. For all the clients that don't care about a timeout, I removed the send_async parameter, and for those that did I replaced it with a timeout. There were a few cases at the GDBRemoteClientBase layer where routines were calling the SendPacket routines with send_async = true w/o requesting that info from ProcessGDBRemote. To those I added a timeout parameter. That I consider a plus since it made explicit which packet request functions were interrupting and which were not.

Given the packet requests that called with send_async = false made up ~95% of all uses, it seemed like marking the few uses that did want to interrupt by the fact that they explicitly receive a timeout parameter was cleaner than switching everybody over to a Timeout, and then having 95% of the clients still have to make up an empty Timeout.

Instead, the rule is now "client functions that are going to interrupt the target get passed a timeout, and otherwise you don't have to worry about it". That removed a bunch of boiler-plate and seems clear to me.

We could go further, as you suggest, and add an m_interrupt_timeout to GDBRemoteClientBase (*). That would avoid ever having to pass a timeout into the SendPacket client functions in GDBRemoteClientBase. We already have ValueChanged callbacks for Properties. We'd have to add a virtual "InterruptTimeoutChanged" to Process to do the right thing for ProcessGDBRemote. None of that would be hard to do.

In fact, I actually thought about doing that, but I didn't because I like the fact that from ProcessGDBRemote, you are able to tell which methods in GDBRemoteClientBase interrupt a running process, and which don't by whether they take a timeout or not. If the timeout were in the GDBRemoteClientBase class you wouldn't see that. But that does seems a thing worth knowing. We could preserve that knowledge by adding some naming convention to distinguish between interrupting & non-interrupting methods. But I don't see what that really gains us.

OTOH, if the general consensus is that is isn't important to know whether an information requesting function in GDBRemoteClientBase is interrupting or not, I can certainly add the variable and get it to track the setting changes in Process & ProcessGDBRemote.

(*) We would have to add it, the one that's already there belongs to the GDBRemoteClientBase::Lock class, which is short-lived, not to the GDBRemoteClientBase per se.

lldb/source/Plugins/Process/gdb-remote/GDBRemoteCommunicationClient.h
514–516

Note, m_interrupt_timeout is not an ivar of GDBRemoteClientBase, it's an ivar of GDBRemoteClientBase::Lock.

I don't see the change as "adding a bunch of timeouts", since what it mostly did was remove a bunch of send_async = false's. For all the clients that don't care about a timeout, I removed the send_async parameter, and for those that did I replaced it with a timeout. There were a few cases at the GDBRemoteClientBase layer where routines were calling the SendPacket routines with send_async = true w/o requesting that info from ProcessGDBRemote. To those I added a timeout parameter. That I consider a plus since it made explicit which packet request functions were interrupting and which were not.

Given the packet requests that called with send_async = false made up ~95% of all uses, it seemed like marking the few uses that did want to interrupt by the fact that they explicitly receive a timeout parameter was cleaner than switching everybody over to a Timeout, and then having 95% of the clients still have to make up an empty Timeout.

Instead, the rule is now "client functions that are going to interrupt the target get passed a timeout, and otherwise you don't have to worry about it". That removed a bunch of boiler-plate and seems clear to me.

We could go further, as you suggest, and add an m_interrupt_timeout to GDBRemoteClientBase (*). That would avoid ever having to pass a timeout into the SendPacket client functions in GDBRemoteClientBase. We already have ValueChanged callbacks for Properties. We'd have to add a virtual "InterruptTimeoutChanged" to Process to do the right thing for ProcessGDBRemote. None of that would be hard to do.

In fact, I actually thought about doing that, but I didn't because I like the fact that from ProcessGDBRemote, you are able to tell which methods in GDBRemoteClientBase interrupt a running process, and which don't by whether they take a timeout or not. If the timeout were in the GDBRemoteClientBase class you wouldn't see that. But that does seems a thing worth knowing. We could preserve that knowledge by adding some naming convention to distinguish between interrupting & non-interrupting methods. But I don't see what that really gains us.

I just note when I see a ton of changes to things that can easily be fixed by adding an ivar I tend to do it. It is fine that the "bool async" parameter went away.

We should probably combine the two functions that only differ by one taking a timeout and the other not. They are very similar and can easily use a Optional<> on the timeout parameter.

OTOH, if the general consensus is that is isn't important to know whether an information requesting function in GDBRemoteClientBase is interrupting or not, I can certainly add the variable and get it to track the setting changes in Process & ProcessGDBRemote.

(*) We would have to add it, the one that's already there belongs to the GDBRemoteClientBase::Lock class, which is short-lived, not to the GDBRemoteClientBase per se.

jingham updated this revision to Diff 343785.May 7 2021, 5:01 PM

I coalesced the timeout & no-timeout versions of GDBRemoteClientBase::SendPacketAndWaitForResponse and GDBRemoteClientBase::Lock::Lock as you suggested. I didn't use an Optional. That seems overkill when there's a perfectly good sentinel value for "don't interrupt": timeout of 0 seconds. So I just made the timeout in these two cases have a default value of 0 seconds, and used that to mean no interrupt. That way passing no timeout means "don't interrupt" and passing a non-zero one means interrupt.

BTW, I generally agree that passing values in parameters that are always going to be used internally anyway is ugly and it's better to store them in an ivar. But in this case, the value was going to get used in some cases and not in others, which storing it as a parameter would have hidden. That's what I was trying to avoid.

shafik added a subscriber: shafik.May 10 2021, 10:38 AM
shafik added inline comments.
lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp
3129

/*insert=*/true

3169

/*insert=*/true

Thanks for combining the functions and fine not to use Optional is we have a good value to indicate no timeout.

Just one inline comment where we can hopefully get rid of kWakeupInterval being a constant.

lldb/source/Plugins/Process/gdb-remote/GDBRemoteClientBase.cpp
56

Should we tailor this timeout a bit better to be related to the time left until we reach m_interrupt_endpoint? If we set the interrupt timeout to 1 second, we will wait for 5 seconds before we return. Maybe this should be calculated by subtracting m_interrupt_endpoint from steady_clock::now()?

jingham updated this revision to Diff 344213.May 10 2021, 2:55 PM

When the ReadPacket call in SendContinuePacketAndWaitForResponse wakes up due to the wakeup timeout, and there's an interrupt which hasn't reached it's endpoint, shorten the wakeup interval so we don't wait longer than the interrupt endpoint.

jingham marked an inline comment as done.May 10 2021, 2:57 PM

Was that what you had in mind, Greg?

clayborg accepted this revision.May 10 2021, 3:56 PM

Was that what you had in mind, Greg?

Yes! Thanks for all the changes. LGTM

This revision is now accepted and ready to land.May 10 2021, 3:56 PM
This revision was landed with ongoing or failed builds.May 11 2021, 11:57 AM
This revision was automatically updated to reflect the committed changes.