Page MenuHomePhabricator

Patch for lldb bug 26322 “core load hangs”
ClosedPublic

Authored by hhellyer on Nov 15 2016, 8:38 AM.

Details

Summary

This patch changes the way ProcessElfCore.cpp handles signal information.
The patch changes ProcessElfCore.cpp to use the signal from si_signo in SIGINFO notes in preference to the value of cursig in PRSTATUS notes. The value from SIGINFO seems to be more thread specific. The value from PRSTATUS is usually the same for all threads even if only one thread received a signal.
If it cannot find any SIGINFO blocks it reverts to the old behaviour and uses the value from cursig in PRSTATUS. If after that no thread appears to have been stopped it forces the status of the first thread to be SIGSTOP to prevent lldb hanging waiting for any thread from the core file to change state.

The order is:

  • If one or more threads have a non-zero si_signo in SIGINFO that will be used.
  • If no threads had a SIGINFO block with a non-zero si_signo set all threads signals to the value in cursig in their PRSTATUS notes.
  • If no thread has a signal set to a non-zero value set the signal for only the first thread to SIGSTOP.

This resolves two issues. The first was identified in bug 26322, the second became apparent while investigating this problem and looking at the signal values reported for each thread via “thread list”.

Firstly lldb is able to load core dumps generated by gcore where each thread has a SIGINFO note containing a signal number but cursig in the PRSTATUS block for each thread is 0.

Secondly if a SIGINFO note was found the “thread list” command will no longer show the same signal number for all threads. At the moment if a process crashes, for example with SIGILL, all threads will show “stop reason = signal SIGILL”. With this patch only the thread that executed the illegal instruction shows that stop reason. The other threads show “stop reason = signal 0”.

Event Timeline

hhellyer updated this revision to Diff 78010.Nov 15 2016, 8:38 AM
hhellyer retitled this revision from to Patch for lldb bug 26322 “core load hangs”.
hhellyer updated this object.
hhellyer added a subscriber: lldb-commits.
clayborg requested changes to this revision.Nov 15 2016, 9:05 AM
clayborg edited edge metadata.

Just a few quick changes.

source/Plugins/Process/elf-core/ProcessElfCore.cpp
208–209

If you are just iterating (not also wanting indexes, or keeping any iterators in the for loop) then use:

for (const auto &thread_data: m_thread_data) {
  if (thread_data.signo != 0)
    siginfo_signal_found = true;
  if (thread_data.prstatus_sig != 0)
    prstatus_signal_found = true;
}
219–220
for (const auto &thread_data: m_thread_data) {
source/Plugins/Process/elf-core/ThreadElfCore.h
93

add const to "arch":

const lldb_private::ArchSpec &arch
100

Add const:

const lldb_private::ArchSpec &arch
This revision now requires changes to proceed.Nov 15 2016, 9:05 AM
jingham edited edge metadata.Nov 15 2016, 9:10 AM

Besides Greg's comments, this looks reasonable to me.

labath added a subscriber: labath.Nov 15 2016, 9:14 AM

Thank you for looking into this. This has been a long standing issue that we haven't got time to address.

Could you also add some tests to cover the new functionality? It sounds like it would be easy to generate tiny core files which trigger this. You can look at tests in TestLinuxCore.py for examples. If you run into trouble there, let me know.

sas added a subscriber: sas.Nov 15 2016, 10:53 AM

Just a couple nits inline. Also, did you run clang-format on your change? I see some issues with ifs and the associated parentheses.

source/Plugins/Process/elf-core/ProcessElfCore.cpp
215

Nit: if (!siginfo_signal_found).

218

Similar nit: if (prstatus_signal_found).

labath added inline comments.Nov 16 2016, 2:23 AM
source/Plugins/Process/elf-core/ProcessElfCore.cpp
224

You should not use signal numbers from the host here -- the macro may not be present (windows), or it may not have the same value. You need to get the correct set of signals with Process::GetUnixSignals(), and then look up the signal by name.

hhellyer updated this revision to Diff 78163.Nov 16 2016, 3:49 AM
hhellyer edited edge metadata.

Update with fixes from review comments.
I've run clang-format, fixed the use of iterators and added const qualifiers.
I removed the use of the SIGSTOP constant and use GetUnixSignals instead.
I've also moved that block down a few lines until after SetUnixSignals has been called.

I will update the patch with the test cases shortly.

hhellyer updated this revision to Diff 78173.Nov 16 2016, 5:14 AM
hhellyer edited edge metadata.

Update patch due to two missing changes.

clayborg accepted this revision.Nov 16 2016, 11:12 AM
clayborg edited edge metadata.

Looks good.

This revision is now accepted and ready to land.Nov 16 2016, 11:12 AM

This is marked ready to land and I can land it now but I'm still working on the testcases.
I can either:

  • Land the code changes and do the tests under another patch.
  • Hold off on delivering this until I've finished the tests.

I'm not sure which is the preferred option. The tests will probably need to be delivered with core files so they will be much larger deltas than just the functional code changes currently contained in this patch. Is there a preferred option?

I can't say we're being very consistent in enforcing it, but general llvm policy is for tests to go together with the changes.

I am curious, how do you go about creating these core files?

The core files I created were very tiny as I did not require any syscalls, so I could avoid linking libc. Here it sounds like you would need at least one (clone(2)). I am considering reimplementing the required clone(2) bits inline, but I am afraid that may end up being a bit hairy.

I haven't solved that yet! ;-)

I'm currently ending up with cores of ~500kb which is probably too big. I'm seeing what I can do to bring them down but it might be that I can't shrink them that much. I'm attempting to write two tests, one for multiple threads and one for cores generated by gcore but the actual C program for both is very similar.

What would you say a sensible target size actually is?

The other option is to run these tests just on Linux then it might be possible to create the core dumps as part of the test. There's a lot of drawbacks to that - not least just that they wouldn't be run every where but also that depending on how the OS is setup it's not always possible to create a core on a signal or attach gcore. I don't really want to add flakey tests.

I haven't solved that yet! ;-)

I'm currently ending up with cores of ~500kb which is probably too big. I'm seeing what I can do to bring them down but it might be that I can't shrink them that much. I'm attempting to write two tests, one for multiple threads and one for cores generated by gcore but the actual C program for both is very similar.

What would you say a sensible target size actually is?

I would say about 100k is OK, but of course, the less, the better.

The other option is to run these tests just on Linux then it might be possible to create the core dumps as part of the test. There's a lot of drawbacks to that - not least just that they wouldn't be run every where but also that depending on how the OS is setup it's not always possible to create a core on a signal or attach gcore. I don't really want to add flakey tests.

Yeah, we can't rely on linux kernel to generate the core files, as there are many ways that could be disabled. If we had a core file writer in LLDB (I don't know if we do, or if it works), we could possibly do that.

That said, I was able to trivially generate a ~20k core file by setting /proc/$PID/coredump_filter to 0. You won't get any memory regions that way, but it does not sound like you need any for this test anyway. It could probably be even improved by some tricks like static linking (to avoid having a long list of modules, etc). Have you tried that?

I'd assumed that gcore would ignore the coredump_filter setting but it turns out it doesn't so I should be able to use that. lldb seems happy enough opening the core files produced with a filter of 0 so it's probably the simplest solution.

I don't think lldb has a core file writer for Linux/ELF, process save-core doesn't seem to think it does. I'm not sure it would help in this case though - I need to make sure we can open core files created via gcore. Opening core files generated by lldb itself would be a separate test case. It's still quite a desirable feature, Linux does not really have a good API for creating core dumps. gcore is useful but it still means exec-ing another process.

I'll tidy up my test cases and try to generate at least 64 and 32 bit intel cores to go with them before adding them to this patch.

Checking core files into the repository will grow all git based repositories over time as every version of every core file will be included in the git clone. We have avoided checking in core files or any other larger files for this very reason. Not sure what the right answer is here.

hhellyer updated this revision to Diff 78524.Nov 18 2016, 7:23 AM
hhellyer edited edge metadata.

Updating the patch to include the test cases and scripts to produce the core dumps.
I haven't included the core dumps and when I do I'll need to update the pid and
tid values in the test scripts. That can wait until we've decided what to do about
the core dumps though.

We probably don't want to check in zillions of core files, but I believe having a couple of them is fine, particularly when care is taken to minimize their size. I believe the benefits (being able to run tests on a piece of code in a reproducible and platform-independent manner) outweigh the downsides. And I don't expect these files to change every week (probably more like never), so they won't bloat the history. Compare that with a 7MB python reference html file we currently store in the repo, which we should update (but we don't) on every API change.

hhellyer updated this revision to Diff 79051.Nov 23 2016, 3:36 AM

Updated to add the test executables and core files.

@labath If you are happy with the tests and the size of their supporting files I'll land this, otherwise let me know if anything needs changing.

I just tried it, and these tests run fine without the executable file. Just remove those, and I think we're ready.

packages/Python/lldbsuite/test/functionalities/postmortem/elf-core/gcore/TestGCore.py
39

You don't need the executable file for this test. Just create an empty target self.dbg.CreateTarget("") and use that.

hhellyer updated this revision to Diff 79056.Nov 23 2016, 5:59 AM

Updating patch to remove executables and create empty targets in the test cases.

I'll land these changes tomorrow morning (UK time) so I can watch the builds and check there's no problems with the tests.

hhellyer closed this revision.Nov 24 2016, 1:06 AM

It fails on the linux buildbot, but not when I run it locally. I am going to investigate the issue.

Oh, you fixed it. Thank you. :)

@labath - Yes sorry about that. To be safe I did the "honest" thing of using arc to pull down the patch from phabriactor and then arc commit to land it to make sure I committed exactly what was reviewed. Somewhere in that it lost the contents of the core files and just created empty files for them. :-(