This is an archive of the discontinued LLVM Phabricator instance.

Fix shared library loading when users define duplicate _r_debug structure.
ClosedPublic

Authored by clayborg on Aug 22 2023, 11:43 PM.

Details

Summary

We ran into a case where shared libraries would fail to load in some processes on linux. The issue turned out to be if the main executable or a shared library defined a symbol named "_r_debug", then it would cause problems once the executable that contained it was loaded into the process. The "_r_debug" structure is currently found by looking through the .dynamic section in the main executable and finding the DT_DEBUG entry which points to this structure. The dynamic loader will update this structure as shared libraries are loaded and LLDB watches the contents of this structure as the dyld breakpoint is hit. Currently we expect the "state" in this structure to change as things happen. An issue comes up if someone defines another "_r_debug" struct in their program:

r_debug _r_debug;

If this code is included, a new "_r_debug" structure is created and it causes problems once the executable is loaded. This is because of the way symbol lookups happen in linux: they use the shared library list in the order it created and the dynamic loader is always last. So at some point the dynamic loader will start updating this other copy of "_r_debug", yet LLDB is only watching the copy inside of the dynamic loader.

Steps that show the problem are:

  • lldb finds the "_r_debug" structure via the DT_DEBUG entry in the .dynamic section and this points to the "_r_debug" in ld.so
  • ld.so modifies its copy of "_r_debug" with "state = eAdd" before it loads the shared libraries and calls the dyld function that LLDB has set a breakpoint on and we find this state and do nothing (we are waiting for a state of eConsistent to tell us the shared libraries have been fully loaded)
  • ld.so loads the main executable and any dependent shared libraries and wants to update the "_r_debug" structure, but it now finds "_r_debug" in the a.out program and updates the state in this other copy
  • lldb hits the notification breakpoint and checks the ld.so copy of "_r_debug" which still has a state of "eAdd". LLDB wants the new "eConsistent" state which will trigger the shared libraries to load, but it gets stale data and doesn't do anyhing and library load is missed. The "_r_debug" in a.out has the state set correctly, but we don't know which "_r_debug" is the right one.

The new fix detects the two "eAdd" states and loads shared libraries and will emit a log message in the "log enable lldb dyld" log channel which states there might be multiple "_r_debug" structs.

The correct solution is that no one should be adding a duplicate "_r_debug" symbol to their binaries, but we have programs that are doing this already and since it can be done, we should be able to work with this and keep debug sessions working as expected. If a user #includes the <link.h> file, they can just use the existing "_r_debug" structure as it is defined in this header file as "extern struct r_debug _r_debug;" and no local copies need to be made.

If your ld.so has debug info, you can easily see the duplicate "_r_debug" structs by doing:

(lldb) target variable _r_debug --raw
(r_debug) _r_debug = {
  r_version = 1
  r_map = 0x00007ffff7e30210
  r_brk = 140737349972416
  r_state = RT_CONSISTENT
  r_ldbase = 0
}
(r_debug) _r_debug = {
  r_version = 1
  r_map = 0x00007ffff7e30210
  r_brk = 140737349972416
  r_state = RT_ADD
  r_ldbase = 140737349943296
}
(lldb) target variable &_r_debug
(r_debug *) &_r_debug = 0x0000555555601040
(r_debug *) &_r_debug = 0x00007ffff7e301e0

And if you do a "image lookup --address <addr>" in the addresses, you can see one is in the a.out and one in the ld.so.

Adding more logging to print out the m_previous and m_current Rendezvous structures to make things more clear. Also added a log when we detect multiple eAdd states in a row to detect this problem in logs.

Diff Detail

Event Timeline

clayborg created this revision.Aug 22 2023, 11:43 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 22 2023, 11:43 PM
clayborg requested review of this revision.Aug 22 2023, 11:43 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 22 2023, 11:43 PM

I'm not familiar with this mechanism but just out of curiosity: ld.so loads the main executable and any dependent shared libraries and wants to update the "_r_debug" structure, but it now finds "_r_debug" in the a.out program and updates the state in this other copy

Is this some undefined behaviour or is there a good use case for this? From the program's point of view.

lldb/source/Plugins/DynamicLoader/POSIX-DYLD/DYLDRendezvous.cpp
260–266

Can we split this sentence just for readability? Bullet points of each step might be clearer.

260–266

What you have in the commit message basically.

lldb/test/API/functionalities/dyld-multiple-rdebug/TestDyldWithMultupleRDebug.py
23

Any specific reason for this? Not that it really matters, it'll get plenty of testing elsewhere.

29

Leftover debug print.

aprantl added inline comments.Aug 23 2023, 9:06 AM
lldb/test/API/functionalities/dyld-multiple-rdebug/TestDyldWithMultupleRDebug.py
55

You should be able to shorten the test setup considerably by using lldbutil.run_to_name_breakpoint() and only manually setting the second breakpoint afterwards.

I'm not familiar with this mechanism but just out of curiosity: ld.so loads the main executable and any dependent shared libraries and wants to update the "_r_debug" structure, but it now finds "_r_debug" in the a.out program and updates the state in this other copy

Is this some undefined behaviour or is there a good use case for this? From the program's point of view.

This is just how name lookups happen across shared library boundaries as far as I know. External symbols get resolved by searching the shared libraries including the main executable.

I have verified this is indeed what is happening by debugging the ld.so binary and stepping through it at this code in rtld.c:

/* Notify the debugger all new objects are now ready to go.  We must re-get
   the address since by now the variable might be in another object.  */
r = _dl_debug_initialize (0, LM_ID_BASE);
r->r_state = RT_CONSISTENT;
_dl_debug_state ();

The code for _dl_debug_initialize() looks like:

struct r_debug *
_dl_debug_initialize (ElfW(Addr) ldbase, Lmid_t ns)
{
  struct r_debug *r;

  if (ns == LM_ID_BASE)
    r = &_r_debug; /// <- this will switch to 
  else
    r = &GL(dl_ns)[ns]._ns_debug;

  if (r->r_map == NULL || ldbase != 0)
    {
      /* Tell the debugger where to find the map of loaded objects.  */
      r->r_version = 1	/* R_DEBUG_VERSION XXX */;
      r->r_ldbase = ldbase ?: _r_debug.r_ldbase;
      r->r_map = (void *) GL(dl_ns)[ns]._ns_loaded;
      r->r_brk = (ElfW(Addr)) &_dl_debug_state;
    }

  return r;
}

And the line "r = &_r_debug;" picks up the first item in the library search paths for "_r_debug" which gets the one from the a.out program...

lldb/source/Plugins/DynamicLoader/POSIX-DYLD/DYLDRendezvous.cpp
260–266

yeah, I was tired last night when finishing this up, I will grab the stuff from the commit message

lldb/test/API/functionalities/dyld-multiple-rdebug/TestDyldWithMultupleRDebug.py
23

Yeah, copy paste issue. I would be able to remove this.

29

good catch, yes!

55

will do

clayborg updated this revision to Diff 552966.Aug 23 2023, 7:59 PM

Address review comments:

  • Added documentation in DYLDRendezvous.h to explain how things are supposed to work
  • Updated the documentation in DYLDRendezvous::GetAction() to explain why we need to work around seeing two consecutive eAdd states
  • Greatly simplify the test that was added per suggestions

Test LGTM, I'll defer to others for the dynamic loader plugin.

clayborg updated this revision to Diff 553259.Aug 24 2023, 2:12 PM

Fixed more header documentation and cleaned up the test even more.

Anyone have any objections? Super easy to repro this bug with the test program.

The added context helps document what was already there so that's a nice improvement.

Something I'm not clear on mechanically. The original r_debug has eAdd set. Then the second r_debug is loaded, which also has eAdd set. What is the state of r_map at that point? I wonder if it could be somehow different between the 2 copies, with each having a subset of the list.

Also, why do we not have to wait for eConsistent on the second r_debug? Is it that we do in fact wait for that, but we load the library list from the first r_debug, then when the second one gets eConsistent, we load the rest from that.

The added context helps document what was already there so that's a nice improvement.

Something I'm not clear on mechanically. The original r_debug has eAdd set. Then the second r_debug is loaded, which also has eAdd set. What is the state of r_map at that point? I wonder if it could be somehow different between the 2 copies, with each having a subset of the list.

Once we stop at the breakpoint the second time, second r_debug has eConsistent set when the original has eAdd, but we only look at the original r_debug in the ld.so binary.

Actually the _only_ thing the "_r_debug" in the main executable has set is the "r_state" when we stop at the second breakpoint, the r_map is set correctly in the ld.so version, but not in the a.out version. If you continue to run, the a.out version does eventually get updated. It just depends when the ld.so writes new things into the new r_debug struct.

I actually ran a test where I created a global variable with the same name but different type:

char _r_debug[40] = "012345678901234567890123456789012345678"; // Same size as "r_debug" for safety

And then ran the program to the entry point and the "r_state" bytes in the above character array had been written to zero, but nothing else was touched. So I know this definitely isn't meant to happen.

The ld.so's "_r_debug.r_map" is correct and has a linked list of all the current libraries. But it really depends on when the _r_debug struct is written to again by ld.so.

Also, why do we not have to wait for eConsistent on the second r_debug?

Because the first eAdd state indicates the ld.so is _about_ to load the shared libraries. So debuggers can't actually set breakpoints in them until all of the shared libraries are actually loaded.

Is it that we do in fact wait for that, but we load the library list from the first r_debug, then when the second one gets eConsistent, we load the rest from that.

It is because when we receive the eConsistent the libraries are actually now loaded into memory and then we can load the libraries in LLDB and can actually resolve any breakpoints since the .text sections are now mapped and we can write breakpoints into their .text sections.

DavidSpickett accepted this revision.Aug 31 2023, 1:36 AM

I understand the mechanism now, so for lack of any other strong opinions, LGTM.

This revision is now accepted and ready to land.Aug 31 2023, 1:36 AM