This is an archive of the discontinued LLVM Phabricator instance.

Strip ELF symbol versions from symbol names
AbandonedPublic

Authored by labath on Feb 25 2015, 10:02 AM.

Details

Summary

The following issue occurs when loading glibc with debug symbols:

  • since the file now has both .symtab and .dynsym sections, LLDB decides to load symbols from .symtab, assuming that this is a superset of .dynsym.
  • this is not entirely true when ELF symbol versions come into play
  • For example, .symtab on glibc contains symbols 'memcpy@@GLIBC_2.14' and 'memcpy@GLIBC_2.2.5', but it does not contain the symbol 'memcpy'
  • when we attempt to evaluate an expression referencing 'memcpy' we fail, as we cannot resolve the symbol.

This patch resolves this problem with stripping the version suffix from the default symbols
(containing '@@').

Diff Detail

Event Timeline

labath updated this revision to Diff 20687.Feb 25 2015, 10:02 AM
labath retitled this revision from to Strip ELF symbol versions from symbol names.
labath updated this object.
labath edited the test plan for this revision. (Show Details)
labath added reviewers: clayborg, emaste, zturner.
labath added a subscriber: Unknown Object (MLST).

+majnemer.

I don't want to block this over a small change, but I really, really want to emphasize that I don't like seeing us performing string operations on mangled names. It's already a big problem all over the codebase, and it's only getting worse.

If it's going to continue becoming increasingly common that we need to understand properties of mangled names, then please consider creating an abstraction in LLVM that has a nice interface and can query properties of mangled names like elf_name.GetNameWithoutVersion() or something similar.

+majnemer, who knows quite a bit about different naming schemes, and has already talked about writing such a class in LLVM. So perhaps efforts to this effect could be coordinated.

clayborg requested changes to this revision.Feb 25 2015, 10:48 AM
clayborg edited edge metadata.

You might say the mangled name is "memcpy@@GLIBC_2.14" and set the normal name to "memcpy" just in case anyone needs to find the original "mangled" symbol. We do this in MachO for things like "_OBJC_CLASS_$_TestObject" which we turn into a symbol whose mangled name is "_OBJC_CLASS_$_TestObject" and whose demangled name is "TestObject". So I would suggest to alway try and maintain the original symbol name as the mangled version for symbols just in case.

This revision now requires changes to proceed.Feb 25 2015, 10:48 AM

when we attempt to evaluate an expression referencing 'memcpy' we fail, as we cannot resolve the symbol.

At the risk of being too picky, we might also want to support calling the correct version of memcpy based on the version of glibc that you are using. It's not particularly important in this case (glibc 2.15 is my personal minspec. But there may be other instances.

The situation is not as simple as considering "memcpy@@GLIBC_2.14" as a mangled version and "memcpy" as unmangled. It is more like the "@@VERSION" suffix gets added to whatever name is produced by the compiler, when you want to do ELF symbol versioning. Which means the string before the "@@" could be mangled already. Looking through the debug symbols of libstdc++ I have found a couple of symbols like "_ZSt10adopt_lock@@GLIBCXX_3.4.11". Here it would be natural to say _ZSt10adopt_lock is the mangled version and "std::adopt_lock" the unmangled. The "@@GLIBCXX_3.4.11" is a version tag used by the elf linker/loader, but is not a part of the symbol name as far as the compiler is concerned -- it is added by deep assembler/linker magic.

I agree that it would be nice to preserve the full symbol name, but I think this would require a new field in the Symbol class to do it properly. I can do that if you think it is the best.

At the risk of being too picky, we might also want to support calling the correct version of memcpy based on the version of glibc that you are using. It's not particularly important in
this case (glibc 2.15 is my personal minspec. But there may be other instances.

With this patch and debug-glibc, "expr memcpy(...)" will call whatever is the latest version of memcpy in the library.
Without this patch and with debug-glibc, "expr memcpy(...)" will return an error.

Without debug symbols in glibc we actually end up with two definitions of memcpy in our symbol table (in the .dynsym table the symbol version is not stored as a @ suffix but in a separate section, which we ignore) and "expr" will call a random one. (regardless of whether this patch is applied or not)

This could be a problem if a user program links with an old version of the symbol. Then he would see "memcpy(...)" in his source code, but if he would try to debug it by issuing an "expr memcpy" command, it would end up taking a different codepath. However, detecting this will not be easy, since you can in theory have different parts of your programs linking to different symbol versions. However, this shouldn't be much of a problem since normally when you compile your program afresh, it will pick the latest version available.

I will hold with the submitting until we can find an acceptable solution.

PS: I know that GLIBC_2.14 sounds old, but this just means that the function has not changed significantly since 2.14. The version of glibc I am looking at is 2.19.

So here is me thinking out loud about this issue...

What are the current use cases for the Symbols and SymTabs in lldb?

  • symbolification (aka looking up a symbol by address): In this case we would probably want to output "memcpy@@GLIBC_2.14" because that _is_ the name of the symbol in the object file and it also provides the most information.
  • symbol resolution (aka looking up a symbol by name): in which situations do we need to do this? Currently, I am aware of only one: user provided expressions in the "expr" command. Are there any other use cases?
    • the ELF versioning spec says that when we do not have any additional information, we should pick the default (latest) version. This is the one with @@ in it's name. When user types "expr memcpy(a,b,c)", we do not have any information, so the string "memcpy" should resolve to the same address as "memcpy@@GLIBC_2.14". We could try to be clever and figure out what version is used in the rest of the code, but that may prove to be quite difficult. Furthermore, we almost definitely want the expression char foo[]="bar"; do_something_with(foo) (which compiles to something involving memcpy), to use the default symbol version, since the user is probably not even aware that there is a call to memcpy involved (I certainly wasn't).
    • we would like to keep the non-default symbol versions (e.g. "memcpy@GLIBC_2.2.5"), so that we can do symbolification, but we don't want "memcpy" to resolve to these symbols unless the user explicitly specifies "memcpy@GLIBC_2.2.5" (which right now he can't as the expr command will bark out a syntax error. It might be possible to call the function by embedding the right asm commands in the expr expression, but I do not care about this right now.

So how do we achieve this? For C symbols we can store the full symbol name in the mangled field and the bare name in the demangled one. However, this does not work for C++ symbols, as they already use both fields. Furthermore, currently the demangling of versioned c++ symbols fails completely as the demangler does not understand the version specifications. For the "symbolification" use case it would be best to have "_ZSt10adopt_lock@@GLIBCXX_3.4.11" as the mangled name and "std::adopt_lock@@GLIBCXX_3.4.11" as the demangled. However, for symbol resolution, we want both "_ZSt10adopt_lock" and "_ZSt10adopt_lock@@GLIBCXX_3.4.11" to resolve correctly. I can think of three ways to achieve this:

  • teach Symbol class to do intelligent string matching, so that it can resolve both versioned and unversioned names. Not optimal since it would complicate the general Symbol class due to a ELF peculiarity.
  • insert two Symbol instances into the Symtab. Symbol resolution would be easy, but if we want to guarantee that we always return the versioned symbol during symbolification, we would need to do something clever there, which is again not nice.
  • allow symbols to have multiple names - again not optimal since it complicates the Symbol class, but at least the version handling could be contained in the ELF specific code - the Symbol wouldn't know about the versions, it would only know it has these 2 (or whatever) names.

As you can see, I am not exactly thrilled by any of these options. What do you think about it?

labath abandoned this revision.Mar 3 2015, 9:37 AM

Abandoning in favor of D8036.