This is an archive of the discontinued LLVM Phabricator instance.

Improve dynamic loader support in DynamicLoaderPOSIXDYLD when using core files.
ClosedPublic

Authored by clayborg on Sep 28 2022, 5:00 PM.

Details

Summary

Prior to this fix, no shared libraries would be loaded for a core file, even if they exist on the current machine. The issue was the DYLDRendezvous would read a DYLDRendezvous::Rendezvous from memory of the process in DYLDRendezvous::Resolve() which would read some ld.so structures as they existed in the middle of a process' lifetime. In core files we see, the DYLDRendezvous::Rendezvous::state would be set to eAdd for running processes. When ProcessELFCore.cpp would load the core file, it would call DynamicLoaderPOSIXDYLD::DidAttach(), which would call the above Rendezvous functions. The issue came when during the DidAttach function it call DYLDRendezvous::GetAction() which would return eNoAction if the DYLDRendezvous::m_current.state was read from memory as eAdd. This caused no shared libraries to be loaded for any ELF core files. We now detect if we have a core file and after reading the DYLDRendezvous::m_current.state from memory we set it to eConsistent, which causes DYLDRendezvous::GetAction() to return the correct action of eTakeSnapshot and shared libraries get loaded.

We also improve the DynamicLoaderPOSIXDYLD class to not try and set any breakpoints to catch shared library loads/unloads when we have a core file, which saves a bit of time.

Diff Detail

Event Timeline

clayborg created this revision.Sep 28 2022, 5:00 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 28 2022, 5:00 PM
clayborg requested review of this revision.Sep 28 2022, 5:00 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 28 2022, 5:00 PM

All current post mortem tests pass with this fix. Not sure how to test this stuff since if a core files have hard coded paths in the rendezvous structures in memory and it isn't possible to make a small core file that contains the right paths.

yinghuitan accepted this revision.Sep 28 2022, 9:54 PM

I think @labath or anyone owns the ELF coredump debugging should take a look. I am surprised other major companies did not hit this issue.

Otherwise, LGTM

This revision is now accepted and ready to land.Sep 28 2022, 9:54 PM

I am surprised other major companies did not hit this issue.

That could be because this is something specific to your environment. Just to be clear, is this happening for *all* core files or only for some of them? If only some, is there anything special about the state of the applications that produced those core files (e.g. are they in the middle of loading a shared library?)
Even though this may very well be the right fix for middle-of-dlopen core dumps (we can't really wait for the loading to finish), I suspect this is actually masking some other problem, as the amount of time an application spends in the RT_ADD state is very brief, and it shouldn't be doing anything crash-prone while in there.

lldb/source/Plugins/DynamicLoader/POSIX-DYLD/DYLDRendezvous.cpp
174–181

How about we just change GetAction to directly return eTakeSnapshot in this case?

clayborg added inline comments.Sep 29 2022, 10:18 AM
lldb/source/Plugins/DynamicLoader/POSIX-DYLD/DYLDRendezvous.cpp
174–181

that is a good idea

I am surprised other major companies did not hit this issue.

That could be because this is something specific to your environment. Just to be clear, is this happening for *all* core files or only for some of them? If only some, is there anything special about the state of the applications that produced those core files (e.g. are they in the middle of loading a shared library?)

These are produced by the linux kernel. If we produce any core files manually we make Minidumps as they contain more useful information. ELF core files are kind of a joke as they don't include any build IDs for the shared libraries so many people often will try to load a core file on another system and they end up just loading the shared libraries from the current machine (like /usr/lib/libc.so etc) and they will never know. ELF core file are only really useful on the same machine or for looking at the main binary's stack frames.

Even though this may very well be the right fix for middle-of-dlopen core dumps (we can't really wait for the loading to finish), I suspect this is actually masking some other problem, as the amount of time an application spends in the RT_ADD state is very brief, and it shouldn't be doing anything crash-prone while in there.

Thanks for the background on the eAdd state, makes more sense, there are no docs on what these states mean so I didn't know what the normal values were. I will check our small core files that are checked in and see if they end up in these states as well and report back. I will also check out the core file that has this issue to see if one of the threads is doing dlopen.

clayborg updated this revision to Diff 463962.Sep 29 2022, 10:51 AM

Switch DYLDRendezvous::GetAction() to return eTakeSnapshot when we have a core file and don't modify m_current.state.

clayborg updated this revision to Diff 464097.Sep 29 2022, 4:50 PM

Fix typo in comment.

labath accepted this revision.Oct 3 2022, 7:01 AM
aadsm added a subscriber: aadsm.Oct 3 2022, 11:31 AM
This comment was removed by aadsm.