This is an archive of the discontinued LLVM Phabricator instance.

ProcessMinidump: Suppress reporting stop for signal '0'
ClosedPublic

Authored by JosephTremoulet on Sep 26 2019, 12:09 PM.

Details

Summary

The minidump exception stream can report an exception record with
signal 0. If we try to create a stop reason with signal zero, processing
of the stop event won't find anything, and the debugger will hang.
So, simply early-out of RefreshStateAfterStop in this case.

Also set the UnixSignals object in DoLoadCore as is done for
ProcessElfCore.

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptSep 26 2019, 12:09 PM

I'm trying to fix an issue where opening a minidump created by breakpad in lldb just hangs, but if I use breakpad's minidump-2-core on that same dump then opening the core dump works fine. From debugging the two cases I can see that the critical logic that ProcessElfCore has which ProcessMinidump lacks is here, and the code in this patch is my attempt to replicate that.

I'd love some help on how to create an appropriate test for this. From looking at existing tests, I gather it would make sense to add a test to lldb/lit/Minidump or packages/Python/lldbsuite/test/functionalities/postmortem/minidump-new (aside: what's the difference? is one preferred?) that has a yaml-ized minidump, the test would try to open the dump and make sure it didn't hang. I'm running into issues when I try to create that test, and I've put details in a gist. There's a program there which launches another program and then uses breakpad to capture a minidump of the child process (another aside: it looks like the existing tests all dump their own process, I think that dumping another process is the reason I'm running into this but others haven't). If I load that minidump without the fixes in this patch, lldb hangs. If I load that minidump with the fixes in this patch, lldb reads it and can show an accurate backtrace, I've put a log of that in the gist. But if I then run that minidump through obj-2-yaml and yaml-2-obj, if I load the resulting minidump, while it doesn't hang, it can't show an accurate backtrace. I've also put up a log of that.

I'd imagine my next step would be to look into what information is getting dropped when I try to do the round-trip through obj-2-yaml and yaml-2-obj, then update those utilities to preserve more, so that I can come back and add this test with a yaml input... does that sound right, or have I gone off track somewhere?

Thanks.

The functionality is fine, though I'd hope it can be cleaned up a bit.

As for testing, yes, yaml2obj has some problems roundtripping complex minidumps (and elf files). Fortunately, for the functionality you're testing, I don't think you really need the executable file, so you can just ignore the elf bit and test with a plain lldb -c foo.dmp (obviously, you won't get the backtrace that way, but you don't really need that here. Round-tripping minidumps is also WIP -- I've only implemented the bits I have needed so far (I am adding MemoryInfoList stream as we speak). For some streams that just means, they're not getting pretty-printed, while for others (those that contain pointers to other parts of the file) it means the reconstituted file ends up containing garbage. The exception stream is one of those cases, but here it looks like you actually want to test the case where the exception stream is not present (?), so that might be fine. Or it might be fine if you manually remove the ExceptionStream from the yaml? If you can send me the minidump you have generated I can try to play around with it to see if I can make a reasonable yaml out of it.

As for the "lldb/lit/Minidump or packages/Python/lldbsuite/test/functionalities/postmortem/minidump-new" part, the main difference there is how the test is written. Which one is better depends on what you want to test and to some extent, also on personal preference. The python tests are better when you want to programatically drive lldb and have complex interactions with it. the lit tests are good for simply executing a bunch of known commands, and verifying static output. For this case, I think a lit test is definitely the easier route.

lldb/source/Plugins/Process/minidump/ProcessMinidump.cpp
218–236

Would it be possible to move this code (except maybe the SetUnixSignals bit) into the RefreshStateAfterStop function? Would be less confusing and it would avoid the need for extra member variables...

JosephTremoulet marked an inline comment as done.Sep 27 2019, 8:42 AM

Fortunately, for the functionality you're testing, I don't think you really need the executable file, so you can just ignore the elf bit and test with a plain lldb -c foo.dmp (obviously, you won't get the backtrace that way, but you don't really need that here.

Ah, yeah, good point.

here it looks like you actually want to test the case where the exception stream is not present (?)

If I'm reading things correctly, the exception stream is there and we get a non-null exception record, which has a tid, but its code and flags are both zero.
This makes me realize that the title I've put here is very misleading, I'll fix that when I update the patch.

If you can send me the minidump you have generated I can try to play around with it to see if I can make a reasonable yaml out of it.

Thanks! Will send.

lldb/source/Plugins/Process/minidump/ProcessMinidump.cpp
218–236

Yeah, happy to. I wasn't sure what the implications of doing so would be, but I guess for a dump file there will only ever be one stop?

  • Move artificial SIGSTOP injection to RefreshStateAfterStop
JosephTremoulet retitled this revision from Add Linux signal support to ProcessMinidump to ProcessMinidump: inject SIGSTOP on Linux if no thread has a signal.Sep 27 2019, 11:57 AM
JosephTremoulet marked 2 inline comments as done.Sep 27 2019, 12:00 PM
JosephTremoulet added inline comments.
lldb/source/Plugins/Process/minidump/ProcessMinidump.cpp
218–236

I've moved it. In doing so I left the early-out for !m_active_exception, which is a behavioral difference from my previous iteration, but this covers my use-case so I think better to keep the change focused like this.

  • Use accessor for m_unix_signals_sp

Thanks for cleaning this up, and for sending the minidump. Looking at the dump, I am pretty sure that the problem is the lack of proper exception stream support in yaml2obj. The stream contains a reference to a thread context, and since yaml2obj does not understand this, it copies the offsets verbatim, and so they end up pointing to random garbage after the yaml round trip.

It doesn't look like it should be too hard to add yaml support for the exceptions stream -- it should only be a matter of adapting the patterns already used for other stream to work for this particular case. Could you give a go at that? Alternatively, you can wait a bit until I finish with the MemoryInfoList stream. After that, I can squeeze some time to do the Exception stream too.

lldb/source/Plugins/Process/minidump/ProcessMinidump.cpp
219

I think you can create this unconditionally.

257–264

After some investigation, I re-learned about the DumpRequested constant, which is already checked for in this code (and it is tested (TestMiniDumpNew.py:test_snapshow_minidump) and works/does not hang). It seems to me like it would make sense to treat this case the same way as DumpRequested, and just don't return any stop info (instead of returning a stop info with signal 0 or SIGSTOP). IOW, you would just put if (signo==0) return; here. Eventually we could apply the same change to process elf core as well..

WDYT?

  • Review feedback
JosephTremoulet retitled this revision from ProcessMinidump: inject SIGSTOP on Linux if no thread has a signal to ProcessMinidump: Suppress reporting stop for signal '0'.Sep 30 2019, 8:55 AM
JosephTremoulet edited the summary of this revision. (Show Details)
JosephTremoulet marked 2 inline comments as done.Sep 30 2019, 9:00 AM

It doesn't look like it should be too hard to add yaml support for the exceptions stream -- it should only be a matter of adapting the patterns already used for other stream to work for this particular case. Could you give a go at that?

Yes, will do, thanks for the pointers.

lldb/source/Plugins/Process/minidump/ProcessMinidump.cpp
257–264

SGTM, and seems to work for my use case. I wasn't sure though whether to leave the if (signo==0) return under if (arch.GetTripe().getOS() == llvm::Triple::Linux). Originally I'd put logic here just because I was copying logic from ProcessElfCore. It looks like the Apple case here would maybe similarly have an issue with exception code zero (?), but the else case (Windows?) doesn't reference the exception_code at all so I'd hesitate to apply this change there... I've left it under the check for Linux, but happy to move it if that seems wrong.

labath marked an inline comment as done.Sep 30 2019, 11:12 PM
labath added inline comments.
lldb/source/Plugins/Process/minidump/ProcessMinidump.cpp
257–264

Yeah, the exception code seems to be used completely differently on other OSs, so making this linux specific seems like the right choice.

  • Rebase
  • Add testcase
clayborg requested changes to this revision.Oct 10 2019, 2:06 PM

We need to grab the address from the exception info and pass that along for linux.

lldb/source/Plugins/Process/minidump/ProcessMinidump.cpp
264

Are exceptions on linux always just a signal number? Mac can be a mach exception or a signal. Might be worth looking at the breakpad sources to see if they ever fill in anything else for linux other than a signal.

Also, there is more info in the exception record that needs to be passed along. Like for a SIGSEGV, the address should be in the exception info somewhere.

This revision now requires changes to proceed.Oct 10 2019, 2:06 PM

We should add a test that has no exception info and verify we get a stopped process if we don't already have a test like this. LLDB doesn't like it when there is no stop reason and might try to resume the process and might not print out the stop and backtrace?

What lldb does not like the most is a stop info with a signal "0" -- that causes a resume. Having no stop info whatsoever seems to work just fine -- and this is already what happens when there is no exception record around. I was surprised by that distinction, but we already have tests with minidumps like those, and they seem to work ok, so I thought it'd make sense to do the same for the case where we have an exception record, but it contains no signal (as opposed to e.g. inventing a SIGSTOP out of thin air). We already have a test for that in this patch.

lldb/source/Plugins/Process/minidump/ProcessMinidump.cpp
264

Having the address (if it is there) would be nice, but I think that's a separate bug/feature, as this patch is about what happens when there is no exception record around.

Just to make sure I'm understanding the feedback correctly, I'll try to summarize. Please let me know if this is off track:

  • When one uses breakpad to generate a minidump for a process that *hasn't* crashed, depending which of its methods one calls to write the dump, the generated dump's exception stream's exception code is either the "dump requested" sentinel or zero.
  • We have a bug, which this patch is meant to fix, wherein loading one of those dumps with exception code zero creates a stop with signal zero, resulting in lldb hanging.
  • We have some "prior art" in ProcessElfCore, which is how we successfully load the core dump you get by running a zero-exception-code minidump through breakpad's minidump-2-core, which is to artificially report SIGSTOP. @labath, your feedback is that this is a hack that we shouldn't extend and should ideally/eventually remove from ProcessElfCore
  • We also have some "prior art" in ProcessMinidump, which is how we successfully load a minidump with the "dump requested" sentinel, which is to create no stop at all. That's what this patch currently does when it sees signal zero. @clayborg, your feedback is that simply checking that the error code is zero is insufficient grounds to suppress stop creation, that we should probably check other fields in the exception stream and create a stop if *any* of them are reporting anything non-null, and/or verify that the only way breakpad/crashpad will produce a linux minidump with null signal is if there's no exception whatsoever

I may need to set this aside for a while and come back to it to address @clayborg's concerns (if I've understood them correctly), but I can live with my downstream changes in the meantime, so that's seeming like the best path forward.

clayborg accepted this revision.Oct 11 2019, 10:59 AM

Just to make sure I'm understanding the feedback correctly, I'll try to summarize. Please let me know if this is off track:

  • When one uses breakpad to generate a minidump for a process that *hasn't* crashed, depending which of its methods one calls to write the dump, the generated dump's exception stream's exception code is either the "dump requested" sentinel or zero.

I believe so.

  • We have a bug, which this patch is meant to fix, wherein loading one of those dumps with exception code zero creates a stop with signal zero, resulting in lldb hanging.

If this patch is solely meant to fix this issue, then I remove my "Requires Changes" and this patch is good to go if it no longer hangs LLDB.

  • We have some "prior art" in ProcessElfCore, which is how we successfully load the core dump you get by running a zero-exception-code minidump through breakpad's minidump-2-core, which is to artificially report SIGSTOP. @labath, your feedback is that this is a hack that we shouldn't extend and should ideally/eventually remove from ProcessElfCore

I agree on this as long as no hang happens!

  • We also have some "prior art" in ProcessMinidump, which is how we successfully load a minidump with the "dump requested" sentinel, which is to create no stop at all. That's what this patch currently does when it sees signal zero. @clayborg, your feedback is that simply checking that the error code is zero is insufficient grounds to suppress stop creation, that we should probably check other fields in the exception stream and create a stop if *any* of them are reporting anything non-null, and/or verify that the only way breakpad/crashpad will produce a linux minidump with null signal is if there's no exception whatsoever

As Pavel mentioned, I was asking for other fields in the exception info to be passed along for other signals, like SIGSEGV. After reading his comments, I agree this can be done in another patch if it isn't already being done.

I may need to set this aside for a while and come back to it to address @clayborg's concerns (if I've understood them correctly), but I can live with my downstream changes in the meantime, so that's seeming like the best path forward.

If this patch fixes the hang, then it is good to go.

This revision is now accepted and ready to land.Oct 11 2019, 10:59 AM

If this patch is solely meant to fix this issue, then I remove my "Requires Changes" and this patch is good to go if it no longer hangs LLDB.

Great, thanks. And just to confirm on the testing:

  • test/functionalities/postmortem/minidump-new/linux-x86_64_not_crashed.cpp is an example program that uses breakpad to generate a dump that uses the sentinel signal, and we have a test checked in that uses a reduced yaml-ized copy of its minidump
  • https://gist.github.com/JosephTremoulet/2689c5ed21d7be5543cfcb70d548e0bc is an example program that uses breakpad to generate a dump that uses null signal, and in this patch I'm adding a test that uses a reduced yaml-ized copy of its minidump, which lldb hangs on loading without this fix and successfully loads with this fix
labath accepted this revision.Oct 14 2019, 12:38 AM
This revision was automatically updated to reflect the committed changes.