This is an archive of the discontinued LLVM Phabricator instance.

Extending step-over range past calls was causing deadlocks, fix that.
ClosedPublic

Authored by jingham on Dec 12 2019, 4:13 PM.

Details

Summary

This change:

https://reviews.llvm.org/D58678

made lldb extend it's "next branch breakpoint" over IsCall instructions as a way to speed up stepping. However, the change wasn't quite complete. When lldb runs to the "next branch breakpoint" it always stops other threads when doing so. That is because as much as possible we want users to be able to focus on the thread they are debugging, and not have events from other threads disrupt their attention.

It was safe to do this for code in a function because, except in very odd circumstances, the actions that might deadlock are always function calls.

Before this change we would run to the call, step into a function, then step out. And we always run all threads when we step out of a function call.

With the change, we were extending the range past the call, but treating it as "code in the function", which it no longer was. That would mean we were running arbitrary code with only on thread running, and that causes deadlocks in the target, and then the "step-over" would never complete.

This patch adds a parameter to GetIndexOfNextBranchInstruction telling whether we did extend past a call or not. Then in ThreadPlanStepOverRange, we check this parameter and run one thread or all threads depending on whether we extended past a call.

I also added a test case for this behavior. It is a little unfortunate that the failure of the test will really be a timeout since if we do it wrong the step-over doesn't return. But I can't think of a better way of doing it. I don't want to add a watchdog timer because then I'd get spurious failures on slow machines or fight against the overall timeouts...

Diff Detail

Event Timeline

jingham created this revision.Dec 12 2019, 4:13 PM
clayborg accepted this revision.Dec 12 2019, 7:26 PM

That must have been fun to find! LGTM.

This revision is now accepted and ready to land.Dec 12 2019, 7:26 PM
labath added a subscriber: labath.Dec 13 2019, 2:34 AM

This looks like a tricky bug. Thanks for tracking it down.

I have two questions though: :D

  • could you use c++11 locking primitives in the test? pthreads do not work on windows
  • what exactly will be the effect of this? Will we start all threads immediately, or will we do the usual dance of trying to run one thread for a short period of time, and then resume all of them if that times out? If it's the latter, I'm wondering if we really need this call detection logic -- we could theoretically just use the "run all" method all the time (except if explicitly disabled), with the knowledge that all reasonable call-less sequences will finish before the "run single thread" timeout expires. And it would automatically also handle weird call-less sequences with spinlocks or what not...

This looks like a tricky bug. Thanks for tracking it down.

I have two questions though: :D

  • could you use c++11 locking primitives in the test? pthreads do not work on windows

I copied the test from somewhere else, so I'll have to change the original too. Let me give this a go.

  • what exactly will be the effect of this? Will we start all threads immediately, or will we do the usual dance of trying to run one thread for a short period of time, and then resume all of them if that times out? If it's the latter, I'm wondering if we really need this call detection logic -- we could theoretically just use the "run all" method all the time (except if explicitly disabled), with the knowledge that all reasonable call-less sequences will finish before the "run single thread" timeout expires. And it would automatically also handle weird call-less sequences with spinlocks or what not...

Currently, the actually stepping plans don't do "try a bit on one thread, then resume all". They are conservative, and will let all threads run anytime they run "arbitrary" code. It's only running expressions that does the one and then many dance.

The original idea was that the "Process::RunThreadPlan" would be used for all the thread plans that might stall if run on only one thread. But before it got wired up for that it became dedicated to expression evaluation. It's more important to try to stay on one thread for expressions, since the user stopped the process somehow and has a reasonable expectation that it will stay stopped. But as a side effect, I never got around to wiring that dance up to the regular stepping.

However, since I've been playing around with this trick for expression evaluation I'm a little less sure I want to extend it to regular operation. OTOH, that WOULD fix problems with stepping over call-less spinlocks. More importantly - since I've never had any reports of that theoretical problem being an actual problem - we would get better at keeping stepping operations on one thread.

OTOH, interrupting a process is not cost-free. We have to send a signal to interrupt it and that can cause EINTR's in places people weren't expecting. This has caused problems in the past, where the interrupt from running an expression will cause a read call to raise an unhandled EINTR. Of course you should always check for EINTR etc so these probably are actually bugs in code, and you'd think people would thank me for surfacing the bug. But when I point that out it doesn't seem to mollify the users who have reported the problem.

This is a common enough "mis-pattern" that I am hesitant to add a stepping method that would really spam the inferior with interrupts. I think our current method is the best compromise between trying to keep focus on the active thread, and changing normal flow of execution as little as possible.

jingham updated this revision to Diff 234206.Dec 16 2019, 5:44 PM

Changed the test case from locking.c - using pthreads directly - to locking.cpp using std::thread & Co.

I also changed the test I cobbed this one from to use std::thread as well. I prospectively removed the expected fail Windows from the expression test, since this should build now.

This revision was automatically updated to reflect the committed changes.

Thanks for the explanation (and for updating the test). This actually explains why it sometimes seemed that I lose control over other threads while stepping one of them. I'll be sure to use the --run-mode argument next time. This seems fine given how the current stepping logic works, though I personally would expect that the other also threads stay stopped when i do something called "thread step-over", but I get the feeling I expect a different kind of threading behavior from my debugger than most people...

lldb/packages/Python/lldbsuite/test/commands/expression/no-deadlock/TestExprDoesntBlock.py