This is an archive of the discontinued LLVM Phabricator instance.

[Symbol] Search symbols with name and type in a symbol file
ClosedPublic

Authored by aleksandr.urakov on Oct 17 2018, 5:11 AM.

Details

Summary

This patch adds possibility of searching a public symbol with name and type in a symbol file, not only in a symtab. It is helpful when working with PE, because PE's symtabs contain only imported / exported symbols only. Such a search is required for e.g. evaluation of an expression that calls some function of the debuggee.

A few weeks ago on lldb-dev there was a discussion of this, it is called Symtab for PECOFF.

Diff Detail

Event Timeline

jingham added a subscriber: jingham.

This seems like the sort of thing Greg should have a look at as well.

clayborg requested changes to this revision.Oct 22 2018, 10:44 AM

All symbol tables are currently extracted from the object files via ObjectFile::GetSymtab(). Are symbols only in the PDB file? If so I would vote to add a "virtual void SymbolVendor::AddSymbols(Symtab *symtab)" and a "virtual void SymbolFile::AddSymbols(Symtab *symtab)" where we take the symbol table that comes from the object file and we can add symbols to it if the symbol file has symbols it wants to add to the object file's symbol table. All symbol queries go through the lldb_private::Symtab class anyway. Care must be taken to watch out for symbols that might already exist from an ObjectFile's symbol table to ensure don't have duplicates.

So I would:

  • Add "virtual void SymbolVendor::AddSymbols(Symtab *symtab);" to SymbolVendor that just calls through to its SymbolFile to do the work
  • Add "virtual void SymbolFile::AddSymbols(Symtab *symtab)" to SymbolFile with default implementation that does nothing
  • Override SymbolFile::AddSymbols() for SymbolFilePDB and add symbols to the provided symbol table
  • Modify *SymbolVendor::GetSymtab()" to get the object file symbol table, then pass that along to any symbol file instances it owns to allow each symbol file to augment the symbol table
  • Remove all "FindPublicSymbols()" code from patch
  • Revert all symbol searching code to just use the Symtab class now that it contains all needed symbols
This revision now requires changes to proceed.Oct 22 2018, 10:44 AM

To answer your question, PE/COFF executable symbol tables are basically
empty

Ok, I'll reimplement this, thanks!

I've implemented things as Greg have said, except for a few moments:

  • haven't added AddSymbols to SymbolVendor because it is not used anywhere, and we can just use SymbolFile::AddSymbols directly inside of SymbolVendor;
  • have made AddSymbols' parameter as a reference instead of a pointer, because there is no reason to send null to this function;
  • have cached symtab in SymbolVendor to not pass it every time through the SymbolFile::AddSymbols.

But if you don't agree with some of these I'm ready to discuss.

clayborg requested changes to this revision.Oct 24 2018, 10:34 AM

Very close. Just a bit of cleanup now that we changed SymbolFilePDB where we don't seem to need m_public_symbols anymore. See inlined comments and let me know what you think

include/lldb/lldb-forward.h
444

Do we need this anymore? See inlined comments below.

source/Plugins/SymbolFile/PDB/SymbolFilePDB.cpp
1344–1346

Maybe get the file address from "pub_symbol" and avoid converting to a SymbolSP that we will never use? See more comments below.

1353

Just make a local symbol and convert it from PDB to lldb_private::Symbol here? Something like:

symtab.AddSymbol(ConvertPDBSymbol(pdb_symbol));

So it seems we shouldn't need the m_public_symbols storage in this class anymore since "symtab" will now own the symbol we just created.

1968–1969

Maybe convert this to:

lldb_private::Symbol ConvertPDBSymbol(const llvm::pdb::PDBSymbolPublicSymbol &pub_symbol)

And we only call this if we need to create a symbol we will add to the "symtab" in SymbolFilePDB::AddSymbols(...)

source/Plugins/SymbolFile/PDB/SymbolFilePDB.h
247

Do we need this mapping anymore? We should just add the symbols to the symbol table during SymbolFilePDB::AddSymbols(...).

This revision now requires changes to proceed.Oct 24 2018, 10:34 AM
aleksandr.urakov marked 5 inline comments as done.Oct 25 2018, 2:50 AM

Ah, yes, sure! It's my mistake. I didn't pay attention to the fact that a symtab owns symbols. I'll update the patch, thanks!

source/Plugins/SymbolFile/PDB/SymbolFilePDB.cpp
1344–1346

Unfortunately there is no method of PDBSymbolPublicSymbol which allows to retrieve the file address directly. I'll calculate it as section + offset instead.

1353

The problem here is that ConvertPDBSymbol can fail e.g. if somehow pub_symbol will contain an invalid section number. We can:

  • change the interface of the function to signal about errors;
  • just assume that the section number is always valid (because we already retrieved it before during the file address calculation).

In both cases we will retrieve the section twice. We also can:

  • just pass already calculated section and offset to ConvertPDBSymbol. But it leads to a blurred responsibility and a weird interface.

So I think that it would be better just create a symbol right here - it seems that it doesn't make the code more complex. What do you think about it?

1968–1969

See the comment above

aleksandr.urakov marked 3 inline comments as done.
clayborg requested changes to this revision.Oct 25 2018, 10:42 AM

Very close, just down to making the SymbolVendor::GetSymtab() call symtab.CalculateSymbolSizes() and symtab.Finalize().

source/Plugins/SymbolFile/PDB/SymbolFilePDB.cpp
1382–1383

Seems like these two lines should be done in the symbol vendor? Maybe this function should return the number of symbols added and the symbol vendor could see if AddSymbols returns a positive number, and if so, call symtab.CalculateSymbolSizes() and symtab.Finalize(). We should also see who else is calling these and remove any calls and only do it in the SymbolVendor one time.

This revision now requires changes to proceed.Oct 25 2018, 10:42 AM

Yes, I'll implement it tomorrow (I'm already OOO now), thanks. But is it really necessary to check the number of symbols added if we must to calculate / finalize the symtab after getting it from object file anyway? May be just always do it after creation and processing by the symbol file? For each symtab it will be done just once because of caching.

Yes, I'll implement it tomorrow (I'm already OOO now), thanks. But is it really necessary to check the number of symbols added if we must to calculate / finalize the symtab after getting it from object file anyway? May be just always do it after creation and processing by the symbol file? For each symtab it will be done just once because of caching.

yes, fine to still have void and always call symtab.CalculateSymbolSizes(); and symtab.Finalize() only in the symbol vendor. Find the other places this is called in the ObjectFile plug-ins and remove them and do them once in Symbol vendor when we fetch the symtab for the first time

Ping! Can you look at this, please?

clayborg accepted this revision.Nov 1 2018, 10:23 AM

Looks good. Thanks for making the changes.

This revision is now accepted and ready to land.Nov 1 2018, 10:23 AM
This revision was automatically updated to reflect the committed changes.
davide added a subscriber: davide.Nov 2 2018, 3:00 PM

This broke MacOS. I'm going to revert this. To reproduce, just run ninja check-lldb with your patches.
Please let me know if you need other informations.

Thanks for catching that! Unfortunately, I have no access to MacOS, can you provide some more info about failure, please?

Thanks for catching that! Unfortunately, I have no access to MacOS, can you provide some more info about failure, please?

Unfortunately the bot logs are gone. When I originally looked at them they weren't particularly informative, so I'm afraid the only way would be that of trying to reproduce this thing on a real machine.

I'm not sure, but have an assumption. Here is the first green build of the green-dragon-24: http://green.lab.llvm.org/green/job/lldb-cmake/12090/ It became green after three changes, one of them is your revert of my commit, while another is Zachary's "Fix the lit test suite". I think that it's Zachary's commit fixed the build, not the revert. Moreover, my commit is Windows specific, I can't figure out, how it can break the MacOS build... So may be we will recommit it back? If it will still fail, then we could take failure logs and revert it back again.

aleksandr.urakov reopened this revision.Nov 3 2018, 2:21 PM

@davide You are right, this patch was the cause of the failure, sorry for that. It seems that I've found a generic issue with this patch. Thanks again for pointing to that!

@clayborg The problem is that there is a bunch of places where a symtab is retrieved directly from an object file, not from a symbol vendor. So it remains uncalculated. If we will just return the recalculation / finalization to object files, it will fix the issue, but symbols from PDB will not be available in this places. We can try to use the symbol vendor instead everywhere in this places (we can retrieve a module from an object file, and we can retrieve a symbol vendor from a module, so it is guaranteed that we can get the symbol vendor in all these places). What do you think about the such approach? What pitfalls can be with it?

This revision is now accepted and ready to land.Nov 3 2018, 2:21 PM

So it depends on what code was retrieving the symbol table from the object file. Can you detail where this was happening?

Yes, sure. It happens in the following functions:

  • Module::ResolveSymbolContextForAddress
  • DynamicLoaderHexagonDYLD::SetRendezvousBreakpoint through findSymbolAddress
  • DynamicLoaderHexagonDYLD::RendezvousBreakpointHit through findSymbolAddress
  • JITLoaderGDB::ReadJITDescriptorImpl
  • ObjectFileELF::GetAddressClass
  • ObjectFileELF::GetSymtab
  • ObjectFileELF::RelocateSection
  • ObjectFileELF::Dump
  • ObjectFileMachO::GetAddressClass
  • ObjectFileMachO::ProcessSegmentCommand
  • SymbolFileDWARF::GetObjCClassSymbol
  • SymbolFileDWARF::ParseVariableDIE
  • SymbolFileDWARFDebugMap::CompileUnitInfo::GetFileRangeMap
  • SymbolFileDWARFDebugMap::InitOSO
  • SymbolFileDWARFDebugMap::ResolveSymbolContext
  • SymbolFileDWARFDebugMap::FindCompleteObjCDefinitionTypeForDIE
  • SymbolFileSymtab::CalculateAbilities
  • SymbolFileSymtab::ParseCompileUnitAtIndex
  • SymbolFileSymtab::ParseCompileUnitFunctions
  • SymbolFileSymtab::ResolveSymbolContext
  • ObjectFile::GetAddressClass
  • SymbolVendor::GetSymtab()

Update the patch, move symtab finalization back to object files.

This patch makes object files and symbol files (in the case if they add symbols in a symtab) to be responsible for finalization of a symtab. It's because a symtab is used in a bunch of places, where it's undesirable to retrieve one through a symbol vendor. For example, when the object file itself uses its symtab, we can't retrieve it from the symbol vendor, because the symbol vendor is implemented in the terms of an object file, so such a solution will introduce a circular dependency, which is undesirable.

But on the other hand, if the object file uses its own symtab, then it likely doesn't rely on presence of symbols from the symbol file in that symtab. The only things it requires are symbols from the object file and finalization of the symtab.

So after this update we have the following guarantees:

  • if a symtab is retrieved from an object file, then it's consistent and guaranteed contains symbols from the object file. It may (or may not) also contain symbols from a symbol file;
  • if a symtab is retrieved from a symbol vendor, then it's consistent and guaranteed contains symbols from an object file and a symbol file.

I've taken a look at the places, where the symtab is retrieved from an object file, and it seems that the only place we need to fix due to that guarantees is the preventive usage of the symbol vendor in the Address::GetAddressClass function.

The disadvantages of the current solution are:

  • when symbols are added in the symtab both from an object file and a symbol file, the symtab is finalized twice;
  • the symtabs retrieved from different places have different guarantees.

But to solve these we need to make some other more higher-level entity (besides the object file) to own the symtab (e.g. symbol vendor) and to rewrite all the related things in object files and symbol files. The problem is that it's not trivial to make it and not to break a lot of current code.

What do you think about this approach?

aleksandr.urakov requested review of this revision.Nov 21 2018, 7:35 AM

Ping! Can you take a look, please?

clayborg accepted this revision.Nov 29 2018, 7:06 AM
This revision is now accepted and ready to land.Nov 29 2018, 7:06 AM

I've recently started looking at adding a new symbol file format (breakpad symbols). While researching the best way to achieve that, I started comparing the operation of PDB and DWARF symbol files. I noticed a very important difference there, and I think that is the cause of our problems here. In the DWARF implementation, a symbol file is an overlay on top of an object file - it takes the data contained by the object file and presents it in a more structured way.

However, that is not the case with PDB (both implementations). These take the debug information from a completely different file, which is not backed by an ObjectFile instance, and then present that. Since the SymbolFile interface requires them to be backed by an object file, they both pretend they are backed by the original EXE file, but in reality the data comes from elsewhere.

If we had an ObjectFilePDB (which not also not ideal, though in a way it is a better fit to the current lldb organization), then this could expose the PDB symtab via the existing ObjectFile interface and we could reuse the existing mechanism for merging symtabs from two object files.

I am asking this because now I am facing a choice in how to implement breakpad symbols. I could go the PDB way, and read the symbols without an intervening object file, or I could create an ObjectFileBreakpad and then (possibly) a SymbolFileBreakpad sitting on top of that.

The drawbacks of the PDB approach I see are:

  • I lose the ability to do matching of the (real) object file via symbol vendors. The PDB symbol files now basically implement their own little symbol vendors inside them, which is mostly fine if you just need to find the PDB next to the exe file. However, things could get a bit messy if you wanted to implement some more complex searching on multiple paths, or downloading them from the internet.
  • I'll hit issues when attempting to unwind (which is the real meat of the breakpad symbols), because unwind info is currently provided via the ObjectFile interface (ObjectFile::GetUnwindTable).

The drawbacks of the ObjectFile approach are:

  • more code - it needs a new ObjectFile and a new SymbolFile class (possibly also a SymbolVendor)
  • it will probably look a bit weird because Breakpad files (and PDBs) aren't really object files

I'd like to hear your thoughts on this, if you have any.

zturner added a comment.EditedNov 29 2018, 11:16 AM

I've recently started looking at adding a new symbol file format (breakpad symbols). While researching the best way to achieve that, I started comparing the operation of PDB and DWARF symbol files. I noticed a very important difference there, and I think that is the cause of our problems here. In the DWARF implementation, a symbol file is an overlay on top of an object file - it takes the data contained by the object file and presents it in a more structured way.

However, that is not the case with PDB (both implementations). These take the debug information from a completely different file, which is not backed by an ObjectFile instance, and then present that. Since the SymbolFile interface requires them to be backed by an object file, they both pretend they are backed by the original EXE file, but in reality the data comes from elsewhere.

Don't DWARF DWP files work this way as well? How is support for this implemented in LLDB?

If we had an ObjectFilePDB (which not also not ideal, though in a way it is a better fit to the current lldb organization), then this could expose the PDB symtab via the existing ObjectFile interface and we could reuse the existing mechanism for merging symtabs from two object files.

I am asking this because now I am facing a choice in how to implement breakpad symbols. I could go the PDB way, and read the symbols without an intervening object file, or I could create an ObjectFileBreakpad and then (possibly) a SymbolFileBreakpad sitting on top of that.

What if SymbolFile interface provided a new method such as GetSymtab() while ObjectFile provides a method called HasExternalSymtab(). When you call ObjectFilePECOFF::GetSymtab(), it could first check if HasExternalSymtab() is true, and if so it could call the SymbolFile plugin and return that

I've recently started looking at adding a new symbol file format (breakpad symbols). While researching the best way to achieve that, I started comparing the operation of PDB and DWARF symbol files. I noticed a very important difference there, and I think that is the cause of our problems here. In the DWARF implementation, a symbol file is an overlay on top of an object file - it takes the data contained by the object file and presents it in a more structured way.

However, that is not the case with PDB (both implementations). These take the debug information from a completely different file, which is not backed by an ObjectFile instance, and then present that. Since the SymbolFile interface requires them to be backed by an object file, they both pretend they are backed by the original EXE file, but in reality the data comes from elsewhere.

Don't DWARF DWP files work this way as well? How is support for this implemented in LLDB?

There are some similarities, but DWP is a bit different. The main difference is that the DWP file is still an ELF (or whatever) file, so we still have a ObjectFile sitting below the symbol file. The other difference is that in case of DWP we still have a significant chunk of debug information present in the main executable (mainly various offsets that need to be applied to the unlinked debug info in the dwo/dwp files), so you can still very well say that the symbol file is reading information from the main executable. What DWARF does in this case is it creates a main SymbolFileDWARF for reading data from the main object file, and then a bunch of inner SymbolFileDWARFDwo/Dwp instances which read data from the other files. There are plenty of things to not like here as well, but at least this maintains the property that each symbol file sits on top of the object file from which it reads the data from. (and symtab doesn't go into the dwp file, so there are no issues with that).

I am asking this because now I am facing a choice in how to implement breakpad symbols. I could go the PDB way, and read the symbols without an intervening object file, or I could create an ObjectFileBreakpad and then (possibly) a SymbolFileBreakpad sitting on top of that.

What if SymbolFile interface provided a new method such as GetSymtab() while ObjectFile provides a method called HasExternalSymtab(). When you call ObjectFilePECOFF::GetSymtab(), it could first check if HasExternalSymtab() is true, and if so it could call the SymbolFile plugin and return that

I don't think this would be good because there's no way for the PECOFF file to know if we will have a PDB file on top of it. If we don't find the pdb file, then the best we can do is use the list of external symbols as the symtab for the PECOFF file. I think a better way would ask the SymbolFile for the symtab. Then the symbol file can either return it's own symtab, or just forward the symtab from the object file (we already have a SymbolFileSymtab for cases when we have no debug info). That is more-or-less what this patch is doing, except that here the SymbolFile is inserting it's own symbols into the symtab created by the object file.

Great observations Pavel! I think it's really important to have
orthogonal/composable abstractions here: the symbols should be decoupled
from the container format IMO.

You know more about the ObjectFile than me so I can't say if
ObjectFileBreakpad is the best interface, but here are my initial
observations (in a somewhat random order):

Even though it doesn't sound like that, ironically, Breakpad might be now better of as an ObjectFile rather than a SymbolFile. The main three pieces of information contained in breakpad files are:

  • list of symbols
  • unwind information
  • line tables

Of these, the first two are presently vended by object files, and only the line table is done by symbol files. The auxiliary pieces of information in the breakpad files (architecture, OS, UUID), are also a property of object files in lldb.

  1. We need clear and separate abstractions for a container (ELF, PE

file, Breakpad symbols) vs. the content (debug Information).

I agree, unfortunately I think we're quite far from that now. This is complicated by the fact that different "symbol file" formats have different notion of what "symbols" are. E.g. it's obvious that Symtab ended up being vended by the object files because that's how elf+macho/dwarf do things, but that's not the case for pecoff/pdb.

  1. We need to be able to consume symbols when the corresponding module

binary is not available. This is common for postmortem debugging (ex.
having a minidump + PDBs, but not all the .DLLs or EXE files).

This is also going to be a bit tricky, though slightly orthogonal requirement. Right now things generally assume that a Module has an ObjectFile to read the data from, even though ProcessMinidump tries to work around that with special kinds of Modules. That might be enough to make addresses resolve somewhat reasonably, but I'm not sure what will happen once we start providing symbols for those kinds of modules.

  1. I'm not a fan of the PDB model, where the symbols are searched in

both the symtabs then in the symbol files. I'd rather like to see the
symtab an interface for symbols regardless of where they come from.

(Zach expressed some efficiency concerns if we'd need to build a

symtab from a PDB for example as opposed to accessing the PDB symfile
directly, although I think we can design to address this - ie. multiple
concrete symtab implementations, some of which are *internally* aware of
the source container, but w/o leaking this through the interface)

I am afraid I am a bit lost here, as I don't know much about PDBs. I'll have to study this in more detail.

  1. The symbol vendor observation is very important. Right now LLDB has

basic support for looking up DWARF symbols and none for PDBs. (for
example, IMO LLDB could greatly benefit from something like Microsoft's
symsrv - I'm actually planning to look into it soon)

(Whatever we do, this should be one of the key requirements IMO)

agreed.

On 29/11/2018 21:29, Leonard Mosescu wrote:

Hi Aleksandr, yes, no objections to this patch.

I was responding to Pavel's comments, which I also assume are
forward-looking as well, not strictly related to this patch.

Agreed, and I apologise for hijacking your review (I seem to be getting in the habit of that). I initially thought that having a ObjectFilePDB might mean that this patch would not be needed, but now I am slowly growing to like it. I think it makes sense for a symbol file to be able to provide additional symbols regardless of whether this could be done differently too.

I've recently started looking at adding a new symbol file format (breakpad symbols). While researching the best way to achieve that, I started comparing the operation of PDB and DWARF symbol files. I noticed a very important difference there, and I think that is the cause of our problems here. In the DWARF implementation, a symbol file is an overlay on top of an object file - it takes the data contained by the object file and presents it in a more structured way.

However, that is not the case with PDB (both implementations). These take the debug information from a completely different file, which is not backed by an ObjectFile instance, and then present that. Since the SymbolFile interface requires them to be backed by an object file, they both pretend they are backed by the original EXE file, but in reality the data comes from elsewhere.

Don't DWARF DWP files work this way as well? How is support for this implemented in LLDB?

There are some similarities, but DWP is a bit different. The main difference is that the DWP file is still an ELF (or whatever) file, so we still have a ObjectFile sitting below the symbol file. The other difference is that in case of DWP we still have a significant chunk of debug information present in the main executable (mainly various offsets that need to be applied to the unlinked debug info in the dwo/dwp files), so you can still very well say that the symbol file is reading information from the main executable. What DWARF does in this case is it creates a main SymbolFileDWARF for reading data from the main object file, and then a bunch of inner SymbolFileDWARFDwo/Dwp instances which read data from the other files. There are plenty of things to not like here as well, but at least this maintains the property that each symbol file sits on top of the object file from which it reads the data from. (and symtab doesn't go into the dwp file, so there are no issues with that).

I am asking this because now I am facing a choice in how to implement breakpad symbols. I could go the PDB way, and read the symbols without an intervening object file, or I could create an ObjectFileBreakpad and then (possibly) a SymbolFileBreakpad sitting on top of that.

What if SymbolFile interface provided a new method such as GetSymtab() while ObjectFile provides a method called HasExternalSymtab(). When you call ObjectFilePECOFF::GetSymtab(), it could first check if HasExternalSymtab() is true, and if so it could call the SymbolFile plugin and return that

I don't think this would be good because there's no way for the PECOFF file to know if we will have a PDB file on top of it.

I'm actually starting to wonder even if GetSymtab() should be part of ObjectFile. The first thing it does is get the Module and then start calling a bunch of stuff on the Module interface. Perhaps the place to start is comparing the Module and ObjectFile interfaces and seeing if the existing APIs make the most sense being moved up to Module. If everything was on Module then the Module has everything it needs to go to the SymbolVendor and find a PDB file.

This revision was automatically updated to reflect the committed changes.