Page MenuHomePhabricator

[LLDB][ELF] Load both, .symtab and .dynsym sections
AcceptedPublic

Authored by kwk on Tue, Sep 10, 2:38 AM.

Details

Summary

This change ensures that the .dynsym section will be parsed even when there's already is a .symtab.

It is motivated because of minidebuginfo (https://sourceware.org/gdb/current/onlinedocs/gdb/MiniDebugInfo.html#MiniDebugInfo).

There it says:

Keep all the function symbols not already in the dynamic symbol table.

That means the .symtab embedded inside the .gnu_debugdata does NOT contain the symbols from .dynsym. But in order to put a breakpoint on all symbols we need to load both. I hope this makes sense.

My other patch D66791 implements support for minidebuginfo, that's why I need this change.

Event Timeline

kwk created this revision.Tue, Sep 10, 2:38 AM
JDevlieghere added inline comments.
lldb/lit/Modules/ELF/load-from-dynsym-alone.test
25

Please also add llvm-strip to the list of support tools (toolchain.py:127).

lldb/source/Plugins/ObjectFile/ELF/ObjectFileELF.cpp
2645

Can you motivate the need for this change? This comment seems to suggest that reading the symtab table should be sufficient as it should contain all the information from the dynsym. If that is not true, it would be worth updating this comment.

2659

Why did you remove the last part of the original comment? This seemed to be the most useful part... The newly added sentences explain what we are doing (which is relatively clear from the code). I'd rather see a comment explaining "why" something needs to happen.

kwk updated this revision to Diff 219587.Tue, Sep 10, 12:45 PM
  • Added llvm-strip to the list of support tools
kwk marked an inline comment as done.EditedTue, Sep 10, 12:55 PM

@JDevlieghere I change the support tool. It was @labath who requested (D66791#inline-601050) to outsource this patch.

Let's put this bit into a separate patch. As I said over IRC, I view this bit as functionally independent of the debugdata stuff (it's definitely independent in it's current form, as it will kick in for non-debugdata files too, which I think is fine).

The change is motivated because of minidebuginfo (https://sourceware.org/gdb/current/onlinedocs/gdb/MiniDebugInfo.html#MiniDebugInfo).

There it says:

  1. Keep all the function symbols not already in the dynamic symbol
  2. table.

That means the .symtab embedded inside the .gnu_debugdata does NOT contain the symbols from .dynsym. But in order to put a breakpoint on all symbols we need to load both. I hope this makes sense.

My other patch D66791 implements support for minidebuginfo, that's why I need this change.

kwk updated this revision to Diff 219594.Tue, Sep 10, 1:21 PM
kwk marked an inline comment as done.
  • update comment for .symtab section with minidebuginfo
kwk added a comment.Tue, Sep 10, 1:22 PM

Fixed the comment as per request.

JDevlieghere accepted this revision.Tue, Sep 10, 2:07 PM

Thanks! This LGTM with the comment on line 2660 rephrased and the motivation as part of the summary/commit message.

lldb/source/Plugins/ObjectFile/ELF/ObjectFileELF.cpp
2659

The symtab section is non-allocable and can be stripped, while the dynsym section which should always be always be there. If both exist we load both to support the minidebuginfo case. Otherwise we just load the dynsym section.

This revision is now accepted and ready to land.Tue, Sep 10, 2:07 PM
kwk updated this revision to Diff 219605.Tue, Sep 10, 2:15 PM
  • Rephrase comment
kwk edited the summary of this revision. (Show Details)Tue, Sep 10, 2:16 PM
labath accepted this revision.Wed, Sep 11, 2:32 AM

lgtm. The reason I requested this to be put separately, is because it is implemented in a way that kicks in even without minidebuginfo. I think this is fine, because entries can be removed from the symtab even without the whole minidebuginfo business. While this format will likely be the only real user of these partial symtabs, in theory, there isn't anything stopping someone from removing symtab entries independently of that. While unlikely, I see no harm in supporting that, if it does not incur any extra maintenance costs.

lldb/lit/Modules/ELF/load-from-dynsym-alone.test
15

s/.dynamic/.dynsym/

lldb/source/Plugins/ObjectFile/ELF/ObjectFileELF.cpp
2648–2650

How about we make this less layered, and rephrase the existing comment a bit: "Information in the dynsym section is *usually* also found in the symtab, but this is not required as symtab entries can be removed after linking. The minidebuginfo format makes use of this facility to create smaller symbol tables.

2667–2670

I wouldn't bother with this. You can just unconditionally create a Symtab object before you start parsing any symbol tables.

kwk updated this revision to Diff 219677.Wed, Sep 11, 2:55 AM

Updated commit message and squashed commits into one rebased onto master.

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptWed, Sep 11, 3:01 AM
kwk marked 3 inline comments as done.Wed, Sep 11, 3:13 AM

@labath I've addressed your comment rewrites in a fixup commit that I've commited without a review (llvm-svn: 371600): https://reviews.llvm.org/rLLDB371600

lldb/source/Plugins/ObjectFile/ELF/ObjectFileELF.cpp
2667–2670

I don't fully agree that it is that simple because further down in the code we do check for if (m_symtab_up == nullptr) and that is a condition I need to respect because of relocation, don't I?

It broke on linux too. You did run the tests before committing, did you?

labath added inline comments.Wed, Sep 11, 6:04 AM
lldb/source/Plugins/ObjectFile/ELF/ObjectFileELF.cpp
2667–2670

Well.. I'm pretty sure you could delete those null checks too. But, given that these null checks seem to be the prevailing pattern in this function, changing that might be better left for a separate patch...

kwk added a comment.Wed, Sep 11, 10:44 PM

@stella.stamenova I've reverted this patch and thanks for letting me know.

@labath I didn't notice that tests failed, I'm really sorry.

@labath how shall we go about this? We do have the situation that when you lookup a symbol you might find it twice if it is in .dynsym and in .symtab. Shall I adjust the test expectation to that or change my implementation?

kwk reopened this revision.Sun, Sep 15, 9:17 PM
This revision is now accepted and ready to land.Sun, Sep 15, 9:17 PM
kwk updated this revision to Diff 220273.Sun, Sep 15, 9:17 PM
  • [LLDB][ELF] Fixup for comments in D67390
  • Change test expectation to find 2 instead of 1 symbol
In D67390#1667270, @kwk wrote:

@labath how shall we go about this? We do have the situation that when you lookup a symbol you might find it twice if it is in .dynsym and in .symtab. Shall I adjust the test expectation to that or change my implementation?

That's a good question (and another reason why I wanted this to be a separate patch). Since only two tests broke it does not seem like having some symbols twice does much harm. OTOH, having an identical symbol twice does seem like asking for trouble down the line. One possibility would be to restrict this merging to the gnu_debugdata case only. Another option would be to make the merging code smarter and avoid adding the symbol a second time if it has the same name and address. That would have the advantage of having the symbol just once in the common case, while still preserving the full information (in case the symbol tables were munged independently of the gnu_debugdata thingy).

Overall, I guess I would prefer the last solution (inserting only different symbols) unless that turns out to be difficult. In that case, I think just restricting this to gnu_debugdata is fine.

BTW, if you want, I think you can submit the rest of the gnu_debugdata changes without waiting for this, if you just adjust the test expectations to account for the fact that symtab+dynsym merging does not work (yet).

lldb/packages/Python/lldbsuite/test/functionalities/load_unload/TestLoadUnload.py
276

You can't do that because this will have only two matches on elf platforms. Since we don't really care about the number here. I suggest finding a string that can be grepped for, which does not depend on the number of matches (maybe just a_function).

kwk added a comment.Mon, Sep 16, 4:57 AM
In D67390#1667270, @kwk wrote:

@labath how shall we go about this? We do have the situation that when you lookup a symbol you might find it twice if it is in .dynsym and in .symtab. Shall I adjust the test expectation to that or change my implementation?

That's a good question (and another reason why I wanted this to be a separate patch). Since only two tests broke it does not seem like having some symbols twice does much harm. OTOH, having an identical symbol twice does seem like asking for trouble down the line. One possibility would be to restrict this merging to the gnu_debugdata case only. Another option would be to make the merging code smarter and avoid adding the symbol a second time if it has the same name and address. That would have the advantage of having the symbol just once in the common case, while still preserving the full information (in case the symbol tables were munged independently of the gnu_debugdata thingy).

Overall, I guess I would prefer the last solution (inserting only different symbols) unless that turns out to be difficult. In that case, I think just restricting this to gnu_debugdata is fine.

The crucial line is the following line in ObjectFileELF::ParseSymbols():

symtab->AddSymbol(dc_symbol);

If I change that to:

symtab->FindSymbolsByNameAndAddress(dc_symbol.GetName(), dc_symbol.GetAddress(), indexVector)
if (indexVector.empty()) {
  symtab->AddSymbol(dc_symbol);
}

is that the logic you have in mind? FindSymbolsByNameAndAddress I would need to implement.

BTW, if you want, I think you can submit the rest of the gnu_debugdata changes without waiting for this, if you just adjust the test expectations to account for the fact that symtab+dynsym merging does not work (yet).

Let's wait first, okay?

In D67390#1667270, @kwk wrote:

@labath how shall we go about this? We do have the situation that when you lookup a symbol you might find it twice if it is in .dynsym and in .symtab. Shall I adjust the test expectation to that or change my implementation?

That's a good question (and another reason why I wanted this to be a separate patch). Since only two tests broke it does not seem like having some symbols twice does much harm.

You should ensure that there aren't multiple symbols for the same address. Mach-o has this problem with the STAB symbols and the standard symbol table symbols. That plug-in will unique them into one using the STAB symbol table index as the lldb::user_id_t.

OTOH, having an identical symbol twice does seem like asking for trouble down the line. One possibility would be to restrict this merging to the gnu_debugdata case only. Another option would be to make the merging code smarter and avoid adding the symbol a second time if it has the same name and address. That would have the advantage of having the symbol just once in the common case, while still preserving the full information (in case the symbol tables were munged independently of the gnu_debugdata thingy).

We really want to present the simplest and best view of the module to the the user. So please do only create one symbol, no matter how many sources we have.

Overall, I guess I would prefer the last solution (inserting only different symbols) unless that turns out to be difficult. In that case, I think just restricting this to gnu_debugdata is fine.

Code to unique symbols isn't that hard to do during the symbol table parsing. Thought don't keep re-sorting the lldb_private::Symbol objects, just keep a side map that has file address to symbol name. I am guessing it will be easier than mach-o because there the different symbols have different info (STAB vs normal symbol table symbol) where as in ELF they should be the same.

BTW, if you want, I think you can submit the rest of the gnu_debugdata changes without waiting for this, if you just adjust the test expectations to account for the fact that symtab+dynsym merging does not work (yet).

Love to see this fixed prior to checkin in possible

In D67390#1671128, @kwk wrote:
In D67390#1667270, @kwk wrote:

@labath how shall we go about this? We do have the situation that when you lookup a symbol you might find it twice if it is in .dynsym and in .symtab. Shall I adjust the test expectation to that or change my implementation?

That's a good question (and another reason why I wanted this to be a separate patch). Since only two tests broke it does not seem like having some symbols twice does much harm. OTOH, having an identical symbol twice does seem like asking for trouble down the line. One possibility would be to restrict this merging to the gnu_debugdata case only. Another option would be to make the merging code smarter and avoid adding the symbol a second time if it has the same name and address. That would have the advantage of having the symbol just once in the common case, while still preserving the full information (in case the symbol tables were munged independently of the gnu_debugdata thingy).

Overall, I guess I would prefer the last solution (inserting only different symbols) unless that turns out to be difficult. In that case, I think just restricting this to gnu_debugdata is fine.

The crucial line is the following line in ObjectFileELF::ParseSymbols():

symtab->AddSymbol(dc_symbol);

If I change that to:

symtab->FindSymbolsByNameAndAddress(dc_symbol.GetName(), dc_symbol.GetAddress(), indexVector)
if (indexVector.empty()) {
  symtab->AddSymbol(dc_symbol);
}

in mach-o plug-in we keep a std::map or something easier. We don't sort or search the current lldb_private::Symbol objects since they aren't sorted, nor are the name indexes created until we are done with adding all symbols. Can we use a side map that is just something like:

std::map<lldb::addr_t, ContString> SymbolMapType;

is that the logic you have in mind? FindSymbolsByNameAndAddress I would need to implement.

BTW, if you want, I think you can submit the rest of the gnu_debugdata changes without waiting for this, if you just adjust the test expectations to account for the fact that symtab+dynsym merging does not work (yet).

Let's wait first, okay?

kwk added a comment.Mon, Sep 16, 9:17 AM

@clayborg what address is it exactly to store here std::map<lldb::addr_t, ContString> SymbolMapType;? As an example dc_symbol.GetAddress().GetFileAddress() is something that would work but as soon as we have minidebuginfo, then we might end having the same symbol coming from two different object files and so their address would still be different. Also do you want me to keep this map in ObjectFileELF?

In D67390#1671463, @kwk wrote:

@clayborg what address is it exactly to store here std::map<lldb::addr_t, ContString> SymbolMapType;? As an example dc_symbol.GetAddress().GetFileAddress() is something that would work but as soon as we have minidebuginfo, then we might end having the same symbol coming from two different object files and so their address would still be different. Also do you want me to keep this map in ObjectFileELF?

We might need a private function on ObjectFileELF that takes an extra parameter. My idea would be something like:

... ObjectFileELF::GetSymtab() {
  std::map<lldb::addr_t, ConstString SymbolMapType;
  SymbolMapType symbol_map;
  ParseSymbolTablePrivate(..., symbol_map); // .symtab
  ParseSymbolTablePrivate(..., symbol_map); // .dynsym
  ParseSymbolTablePrivate(..., symbol_map); // .other?
clayborg added a comment.EditedMon, Sep 16, 9:59 AM
In D67390#1671463, @kwk wrote:

@clayborg what address is it exactly to store here std::map<lldb::addr_t, ContString> SymbolMapType;? As an example dc_symbol.GetAddress().GetFileAddress() is something that would work but as soon as we have minidebuginfo, then we might end having the same symbol coming from two different object files and so their address would still be different. Also do you want me to keep this map in ObjectFileELF?

The file address should be sufficient for normal ELF files. When we have a minidebuginfo, we should almost to loading this as part of the ObjectFileELF that points to the minidebuginfo and parsing it as if it were part of that file. SymbolVendor used to exist to allow one view of an executable using multiple individual ObjectFile objects, but that got removed. So now it might be best to load the minidebuginfo file as an ObjectFileELF _just_ to access the data in the ".gnu_debugdata" and decompress it, and use that data to add to the symbol table of the current file? So the code would be:

ObjectFileELF::GetSymtab() {
  std::map<lldb::addr_t, ConstString SymbolMapType;
  SymbolMapType symbol_map;
  ParseSymbolTablePrivate(..., symbol_map); // .symtab from current file
  ParseSymbolTablePrivate(..., symbol_map); // .dynsym from current file
  // Detect ".gnu_debugdata" file and load it
  FileSpec gnu_debugdata_file(...);
  if (gnu_debugdata_file.Exists()) {
    ObjectfileELF gnu_debugdata(gnu_debugdata_file);
    gnu_debugdata_symtab_data = gnu_debugdata.GetSectionData();
    // Decompress above data and parse the symbol table as if it is part of this file?
   ParseSymbolTablePrivate(...); // .gnu_debugdata symtab
  }

This assumes that the .gnu_debugdata file has the same section definitions for things like .text and .data etc.

kwk added a comment.Tue, Sep 17, 2:13 AM

@clayborg thank you for this explanation. My patch for minidebuginfo is already done in D66791 . But this one here I was asked to pull out as a separate patch. For good reasons as we see. I wonder how this extra parameter SymbolMapType of yours does help. In the end I have to identify duplicates. But if no symbol with the same name should be added then why do I care about where the symbol is coming from?

Please help me understand of follow my thoughts here:

When I'm in the (lldb) console and type b foo I expect LLDB to set a breakpoint on the function foo, right? The type of the symbol foo is deduced as function. I ask this question because Symtab has no function to just search for a symbol by name; instead you always have to pass an address, a type or an ID:

Symbol *FindSymbolByID(lldb::user_id_t uid) const;
Symbol *FindSymbolWithType(lldb::SymbolType symbol_type,
size_t FindAllSymbolsWithNameAndType(ConstString name,
size_t FindAllSymbolsWithNameAndType(ConstString name,
size_t FindAllSymbolsMatchingRexExAndType(
Symbol *FindFirstSymbolWithNameAndType(ConstString name,
Symbol *FindSymbolAtFileAddress(lldb::addr_t file_addr);
Symbol *FindSymbolContainingFileAddress(lldb::addr_t file_addr);
size_t FindFunctionSymbols(ConstString name, uint32_t name_type_mask,

So my point of this whole question is: What makes a symbol unique in the sense that it shouldn't be added to the symtab if it is already there?

Shouldn't the type of the symbol together with it's name define uniqueness? We shouldn't care about where the symbol is coming from nor if it is located at a different address. Well, if there's an overloaded function foo(int) and foo(char*) then both symbols are of type function and they both share the same name. When you type b foo you DO want 2 breakpoints to be set. Hence, niqueness cannot be defined over the name and the type . But wait, the name is mangled, so it IS unique enough unless I use Symbol::GetNameNoArguments(); there only the name is returned.

Here's my naive approach to test the admittedly very weird thought process from above: https://github.com/kwk/llvm-project/commit/5da4559a00c73ebefd8f8199890bd1991c94fa3f

In D67390#1672210, @kwk wrote:

So my point of this whole question is: What makes a symbol unique in the sense that it shouldn't be added to the symtab if it is already there?

A symbol name is not unique because you can have multiple (static) functions with the same (mangled) name in one module. An address is not unique as well because you can have symbol aliases, which will have the same address (and we want to keep both names to resolve name breakpoints correctly for instance).

The name+address combination (my original suggestion) should be sufficiently unique for the purposes we care about. Theoretically, if you want, you could include some additional items in the uniqueness "key" like symbol type etc. (to rule out the perverse case of somebody setting a "file" symbol to conflict with some other function symbol), but I don't think that is really necessary.

In D67390#1672210, @kwk wrote:

@clayborg thank you for this explanation. My patch for minidebuginfo is already done in D66791 . But this one here I was asked to pull out as a separate patch. For good reasons as we see. I wonder how this extra parameter SymbolMapType of yours does help. In the end I have to identify duplicates. But if no symbol with the same name should be added then why do I care about where the symbol is coming from?

The uniqueness is for symbols with the same name and file address and size. You can have multiple symbols with the same name, and each on could have a different address. We want there to only be one symbol per ObjectFile that has the same name + addr + size. That way when we ask for symbols by name, we don't have to end up getting more than one symbol for something that is the same thing.

Please help me understand of follow my thoughts here:

When I'm in the (lldb) console and type b foo I expect LLDB to set a breakpoint on the function foo, right? The type of the symbol foo is deduced as function.

Breakpoints ask for symbols whose type is lldb::eSymbolTypeCode and that match the name. The name matching is much more complex than you would think though. "b foo" by default turns into 'set a breakpoint on functions whose "basename" matches foo'. This means a C function named 'foo', any C++ method (stand alone function or class method) whose basename is 'foo' (bar::foo(int)", "foo(int)", "foo(float)", "std::a::b::foo()", many more) and any objective C function whose selector is 'foo' ("-[MyClass foo]", "+[AnotherClass foo]", and any other basename from any other language.

If you type "b foo::bar" this will end up looking up all functions whose basename is "bar" and then making sure any found matches contain "foo::bar".

I ask this question because Symtab has no function to just search for a symbol by name; instead you always have to pass an address, a type or an ID:

Symbol *FindSymbolByID(lldb::user_id_t uid) const;
Symbol *FindSymbolWithType(lldb::SymbolType symbol_type,
size_t FindAllSymbolsWithNameAndType(ConstString name,
size_t FindAllSymbolsWithNameAndType(ConstString name,
size_t FindAllSymbolsMatchingRexExAndType(
Symbol *FindFirstSymbolWithNameAndType(ConstString name,
Symbol *FindSymbolAtFileAddress(lldb::addr_t file_addr);
Symbol *FindSymbolContainingFileAddress(lldb::addr_t file_addr);
size_t FindFunctionSymbols(ConstString name, uint32_t name_type_mask,

FindAllSymbolsWithNameAndType() takes a name. The symbol type can be lldb::eSymbolTypeAny, so yes there is way to search. So there is a way to search for a symbol by name. The main issue is we are still parsing the symbol table and we don't want to initialized the name lookup table in the symbol table just yet since we are still adding symbols to the complete list. This is why an extra map that is used only during parsing of the symbol table makes sense.

So my point of this whole question is: What makes a symbol unique in the sense that it shouldn't be added to the symtab if it is already there?

Symbol is unique within one representation of an ObjectFile where the symbol has the same name, address and size and type.

Shouldn't the type of the symbol together with it's name define uniqueness? We shouldn't care about where the symbol is coming from nor if it is located at a different address.

We don't care where the symbol comes from as long as it is representing information for one ObjectFile. I would contend that if you have an "a.out" binary that has a .gnu_debugdata that points to "a.out.gnu_debugdata" that we would have one single ObjectFile that represents "a.out" and give the best description of what "a.out" contains.

We do care if a symbol has a different address. You can have as many static functions called "foo" as you want in a single binary. They are each unique since they have different addresses. So if you have 10 source files where 3 of those sources have a symbol "foo", we want there to be 3 different symbols with the same name, different addresses and possibly different sizes.

Well, if there's an overloaded function foo(int) and foo(char*) then both symbols are of type function and they both share the same name. When you type b foo you DO want 2 breakpoints to be set.

Yes we do! One breakpoint in LLDB has N breakpoint locations. A breakpoint of any kind (source file and line, function name breakpoint, and more) all constantly adding and removing locations as shared libraries get loaded. So if you have "foo(int)" and "foo(char *)" and say "b foo" we would end up with one breakpoint whose name matches "foo" with two locations.

Hence, niqueness cannot be defined over the name and the type . But wait, the name is mangled, so it IS unique enough unless I use Symbol::GetNameNoArguments(); there only the name is returned.

Again, there can be N breakpoints with the same name and different addresses. We are just trying to avoid a symbol table that has 3 symbols all with the name "foo", address of 0x1000 and size of 0x10. Why? Because the information is redundant and is just noise. The map I suggested would track these symbols so we don't end up adding multiple of the same symbols. Again, name address and size must all match. We _can_ have multiple symbols at the same address with different names. How? Linkers often have aliased symbols that point to the same thing. if we have a symbol "foo" at 0x1000 with size 0x10, we can also have a symbol "foo_alias" at 0x1000 and size 0x10. We need both so that we can set breakpoints correctly when setting breakpoints by name.

Here's my naive approach to test the admittedly very weird thought process from above: https://github.com/kwk/llvm-project/commit/5da4559a00c73ebefd8f8199890bd1991c94fa3f

I will take a look.

Let me know if you have any questions about what I said above.

In D67390#1672210, @kwk wrote:

So my point of this whole question is: What makes a symbol unique in the sense that it shouldn't be added to the symtab if it is already there?

A symbol name is not unique because you can have multiple (static) functions with the same (mangled) name in one module. An address is not unique as well because you can have symbol aliases, which will have the same address (and we want to keep both names to resolve name breakpoints correctly for instance).

The name+address combination (my original suggestion) should be sufficiently unique for the purposes we care about. Theoretically, if you want, you could include some additional items in the uniqueness "key" like symbol type etc. (to rule out the perverse case of somebody setting a "file" symbol to conflict with some other function symbol), but I don't think that is really necessary.

We could track any extra data we need in the map if needed as Pavel suggests above. Not sure if it is needed, but we could do it if necessary.

kwk added a comment.Wed, Sep 18, 1:35 AM

@clayborg are you on IRC?