Fix for bug 34532 - A few rough corners related to post-mortem debugging (core/minidump)
ClosedPublic

Authored by lemo on Sep 8 2017, 4:23 PM.

Details

Summary

The main change is to avoid setting the process state as running when debugging core/minidumps (details in the bug).

Also included a few small, related fixes around how the errors propagate in this case.

Bug: https://bugs.llvm.org/show_bug.cgi?id=34532

Differential Revision: https://reviews.llvm.org/D37651

Diff Detail

lemo created this revision.Sep 8 2017, 4:23 PM
lemo edited the summary of this revision. (Show Details)Sep 8 2017, 4:29 PM
sas requested changes to this revision.Sep 8 2017, 4:35 PM
sas added a subscriber: sas.

I think it would be cleaner/smaller if you didn't use pure-virtual functions, but instead you had CanReseume() return false in Process.h, and had a default implementation of DoResume() that called llvm::report_fatal_error().

Instead, here, you're defaulting CanResume() to true and forcing every child class to redefine as false and provide an error'ing implementation of DoResume().

This revision now requires changes to proceed.Sep 8 2017, 4:35 PM
jingham requested changes to this revision.Sep 8 2017, 4:43 PM
jingham added a subscriber: jingham.

This seems like an awkward solution to me. Having CanResume and DoResume seems like it over-determines the problem.

Why wasn't it enough for DoResume to fail and the error to be propagated?

I'm also uncomfortable that this is making a possible code path for some programmer error to cause a fatal error in lldb (which we really want never to happen). It would be much better to rely on the failure of DoResume.

But whatever the solution, you should never use llvm::report_fatal_error. It's fine to put in an lldb_assert which will assert when the error is hit in the test suite, but not in released lldb's. After all, the worst that will happen if somebody working on lldb were to leave in a code path that called DoResume without CanResume is that the state of that core file target would be wrong. That is not an error worth taking down whoever happened to load the lldb framework into their program, or destroying all the other debug sessions or targets active in lldb.

I think I agree with Jim that it would be better to propagate an error from DoResume than to introduce CanResume. I could imagine situations where DoResume could fail for a reason that CanResume was unable to predict. Having one path for handling failure to resume seems cleaner.

Also, consider adding a test. I think it should be feasible to check the process state after attempting to resume and getting an error.

source/Commands/CommandObjectThread.cpp
778

Yeah, it looks like an oversight that the error was never checked, so this is good.

Make sure to run git clang-format to fix those little formatting nits (like the missing space after if.

lemo added a comment.Sep 8 2017, 6:01 PM

Thanks everyone for the feedback.

I see a common question is why doesn't DoResume() fail-fast itself, or why don't the callers propagate the error. It's a good question, since the knowledge if a subclass can resume or not is almost codified in there. The problem is that the state is changed prior to calling DoResume() - see Process::Resume(), ... m_public_run_lock.TrySetRunning() ..., thus the need for an earlier check.

@sas: Yes, a flip approach is possible and I briefly considered the default CanResume() to return false exactly as you're suggesting. The downside is a more brittle composability - subclasses can't call the base implementation, which is usually a reasonable expectation if one wants to build additional functionality on top of the base one. So I prefer the version I sent out, but I could be persuaded either way.

@jingham: 1. I'm not sure I understand the comment about "over-determining the problem", can you please elaborate? If this is about having two methods instead of just DoResume(), I explained above the rationale. 2. Sorry, but defensive programming is exactly the wrong thing, and the root cause of what I'm fixing here. If the internal state is inconsistent the _only_ valid solution is to fail-fast. I've seen people arguing around this for everything from embedded systems to distributed services and trying to limp along pretending things are not too bad only lead to more pain.

@amccarth: I agree that DoResume() can fail for reasons that CanResume() can't predict, and that's prefectly fine - the error is returned (and hopefully it's propagated correctly). Having DoResume() called for a core or minidump is never ok though, and that's the fail-fast path.

(thanks' for the clang-format tip, I'll do that shortly)

@

lldb is a library used in other programs (Xcode, VSCode etc) and it can support many simultaneous debug sessions and each debug session can support many simultaneous targets. Unless this failure is going to make all the other debug sessions fail, and the other target sessions fail and cause the app which has loaded us to fail, we have no business crashing.

I know there is debate about this one side and another but for lldb this is a settled issue. Unless you really are in a state where the world is about to come crashing down around you, you can't raise a fatal error in lldb. And in this case, the world is only very minorly strange, so it is certainly not appropriate.

As for the overdetermined remark, it's what you suspect, that you are now asking two questions to get one answer, did the resume succeed. If the resume didn't succeed the state should be set back to stopped no matter why it didn't succeed. So you are fixing only one specific case only. Is it really not possible to either hold off on changing the process state till you've gotten back the result from DoResume, or correct the state after the fact? That seems the obviously correct fix here.

And I don't understand your answer to Adrian. In the case of core files, DoResume is clearly failing so if the error WERE handled correctly, why would there be any work needed?

lemo updated this revision to Diff 114890.Sep 12 2017, 1:32 PM
lemo added a reviewer: markmentovai.

Revised patch based on the review feedback: I looked into updating the running state after everything else succeeds, but it proved a bit awkward since 1) TrySetRunning() can fail and 2) if we check the running state then use SetRunning() instead of TrySetRunning() it opens the possibility for a race.

For reference, here are the considered options:

  1. Fix Process::Resume() so we only change the state late, after everything succeeds
  2. Rollback state change if anything fails while resuming
  3. Patch the specific case of "resume is not supported"
  4. Do nothing

The updated patch is using #2 - rollback the running state to "stopped" if the resume operation fails. The downside is the possibility of a "partial rollback" but from a cursory review of the code paths the risk seems no higher than option #1. Thoughts?

I'm looking into adding a test case as well but I'm uploading the WIP patch to get everyone a chance to take an early look.

Thanks!
Lemo.

lemo updated this revision to Diff 115068.Sep 13 2017, 10:26 AM
lemo marked an inline comment as done.

Adding test coverage

jingham accepted this revision.Sep 13 2017, 10:44 AM

This looks okay for now. It will end up sending a Running & then a Stopped event. That's a little awkward, but that happens in the ordinary course of debugging anyway so it shouldn't freak anybody out.

It would be better to find a place where this can be fixed before we send the running event, but that's going to take more thinking, and I don't see that desire should block this patch. But can you file another PR to go figure out how to fix this closer to the source, so we don't forget to fix this next time we're in the bowels of the process event code?

Thanks!

lemo added a comment.Sep 13 2017, 10:50 AM

@Jim: Good point, I created this bug to track this as you suggested:
Bug 34594 - Revisit the failure path for Process::Resume() to avoid leaking Stopped->Running->Stopped events

amccarth accepted this revision.Sep 13 2017, 10:52 AM

LGTM. Thanks for adding the tests.

@sas: Do the latest changes address your concerns with this patch?

sas accepted this revision.Sep 13 2017, 2:37 PM

Seems fine.

This revision is now accepted and ready to land.Sep 13 2017, 2:37 PM
This revision was automatically updated to reflect the committed changes.

FYI - this broke at least Windows, because the default behavior of Process::WillResume() changed from success to an error and WindowsProcess does not override WillResume(). Not sure what other platforms are in the same boat.

The default return value of WillResume should be no error. Sorry for not catching this. The core file Process subclasses will need to override this manually.

lemo updated this revision to Diff 115732.Sep 18 2017, 3:34 PM

Fixed the FreeBSD/Windows break : the intention was to keep Process::WillResume() and Process::DoResume() "in-sync", but this had the unfortunate consequence of breaking Process sub-classes which don't override WillResume().

The safer approach is to keep Process::WillResume() untouched and only override it in the minidump and core implementations.