This is an archive of the discontinued LLVM Phabricator instance.

ObjectFileBreakpad: Implement sections
ClosedPublic

Authored by labath on Dec 7 2018, 7:08 AM.

Details

Summary

This patch allows ObjectFileBreakpad to parse the contents of Breakpad
files into sections. This sounds slightly odd at first, but in essence
its not too different from how other object files handle things. For
example in elf files, the symtab section consists of a number of
"records", where each record represents a single symbol. The same is
true for breakpad's PUBLIC section, except in this case, the records will be
textual instead of binary.

To keep sections contiguous, I create a new section every time record
type changes. Normally, the breakpad processor will group all records of
the same type in one block, but the format allows them to be intermixed,
so in general, the "object file" may contain multiple sections with the
same record type.

Event Timeline

labath created this revision.Dec 7 2018, 7:08 AM
clayborg requested changes to this revision.EditedDec 7 2018, 8:51 AM

So sections should correspond to different kinds of sections, like .text, .data, etc. If we have the following breakpad file:

MODULE Linux x86_64 0000000024B5D199F0F766FFFFFF5DC30 linux.out
INFO CODE_ID 00000000B52499D1F0F766FFFFFF5DC3
FILE 0 /tmp/a.c
FUNC 1010 10 0 _start
1010 4 4 0
1014 5 5 0
1019 5 6 0
101e 2 7 0
FUNC 2010 10 0 main
2010 4 4 0
2014 5 5 0
2019 5 6 0
201e 2 7 0
PUBLIC 1010 0 _start
PUBLIC 2010 0 main
PUBLIC 3010 0 nodebuginfo1
PUBLIC 3020 0 nodebuginfo2

I would expect to have just 1 section named ".text" with read and execute permissions. This section would have its m_file_addr set to 0x1010 (from FUNC or PUBLIC with lowest address ("FUNC 1010 10 0 _start" in this case)). The size of the section would be set to the max code address + size minus the lowest code address (0x3020 - 0x1010 in this case). The file offset and file size should be set to zero since section contents are typically the bytes for the disassembly.

One other way to do this would be to create a section for each FUNC whose m_file_addr it set to the FUNC start address, and whose size is set to the last line entry - FUNC start address. Then name of the section can be set to the function name in this case. I am not a huge fan of this since it just creates extra sections for no reason and the debug info will have this info anyway so it will be duplicated.

I see no reason to create sections for MODULE, INFO, FILE, or STACK records. Was there a reason you wanted to create sections for all these? If you need the contents later, it seems like the initial parse pass of this file can easily store these items (as StringRef?) as a member variable of the ObjectFile.

Once you start parsing the debug info, the lldb::user_id_t for any items, like FUNC, can just be the line number or character offset for the FUNC source line within the file.

This revision now requires changes to proceed.Dec 7 2018, 8:51 AM

Another point of clarification is that sections exist in order to lookup addresses and resolve addresses to a section within a file. The section should be something that can easily be slid around when loaded by LLDB when we are debugging or symbolicating. So any sections we create should be able to be have the section load address set in the target with code like:

if (target.GetSectionLoadList().SetSectionLoadAddress(section_sp, section_sp->GetFileAddress() + slide))

All of the sections you added, except the FUNC section, wouldn't end up ever being loaded. All items besides FUNC might be better represented as symbols.

I guess I should elaborate more on the direction where I am going with this. I am trying to model these breakpad files as a debug-info-only object file, like something you would get by running say strip --only-keep-debug. This object file will contain a bunch of sections, but none of them will be real loadable sections. They will basically just be containers for data (DWARF, most likely). Like the .debug_*** sections, my sections also have vm_size set to 0, so there no notion of them being in memory or being slid around. The idea is that this file will never be used as the main object file for a module (*), but rather an object file that a symbol vendor uses to add symbol information to the module.

I have a follow-up patch to this (not yet ready for upload), where I create a SymbolFileBreakpad, which takes the "PUBLIC" and "FUNC" sections and uses them to add symbols into the symtab of the main object file using the interface we added for SymbolFilePDB (while doing that, I lookup these addresses in the main object file and resolve them to real sections (.text, etc.). Then, another part of that plugin would take the line information from the "FUNC" section, and convert that into a lldb_private::LineTable structure. So, basically the real action will happen in the SymbolFile plugin, and this ObjectFile is there just as a fancy container for the data.

(*) I am deliberately not handling the scenario where we have the main ObjectFile missing. We need to be able to handle the case when we cannot find the main object file regardless of whether we have the breakpad file around or not (and ProcessMinidump kind of does that right now, but I believe that should be generalized), so I am planning to have breakpad just piggy-back on that. Then if we have a real object file, we can get accurate section information from there. If not, then all of our symbols will resolve to some fake section encompassing the whole module.

Does this approach make sense?

So we do something like you describe with the DYSM files. The object file is "a.out" and it has a dSYM file "a.out.dSYM/Context/Resources/DWARF/a.out" and the dSYM file will share sections with the "a.out" object file. So if you plan on loading the breakpad file as a symbol file that just needs some sections that it can give to the debug info it will eventually create, these sections must be the same between the binary and the breakpad object file. So I would still recommend trying to make sections that make sense like a real object file. It is also nice to be able to live off of the breakpad file itself. In this case the ObjectFile for breakpad when it is stand alone should do as good of a job as possible.

If you plan on not making the breakpad file ever stand alone, then you will need to take any addresses and look them up in the module section list and use the other sections. I don't see why the breakpad file can't be stand alone though. It won't be as accurate, but it sure would be nice to be able to load a bunch of them into LLDB without needing to find the original executable and just symbolicate no?

If you plan on not making the breakpad file ever stand alone, then you will need to take any addresses and look them up in the module section list and use the other sections. I don't see why the breakpad file can't be stand alone though. It won't be as accurate, but it sure would be nice to be able to load a bunch of them into LLDB without needing to find the original executable and just symbolicate no?

I could try to make it stand-alone, but that seems to me like a duplication of effort. And since the sections I could conjure up from the breakpad info would never match the original elf file, I would have to support both cases anyway, one with using the sections from the object file, and one with the own, made-up sections.

I do intend to support both cases, but in a slightly different way. The way, I see it we have actually four cases to consider:

  1. we have an stripped elf file, and a breakpad symbol file (the case of an unstripped elf file is uninteresting, as it will have much better debug info than breakpad can possibly provide)
  2. we don't have an elf file, but we have a breakpad file
  3. we don't have an elf file nor a breakpad file

Because of case 3, we have to do some section conjuring independently of any breakpad file. We already do that to some extent, and @lemo is getting ready to extend that in D55142. Once we have that, and we find a breakpad file for this module (case 2), the breakpad file should be able to just latch onto the sections created for the placeholder object file. And I believe ProcessMinidump is in a better position to create the "placeholder" sections, as it has both access to the load addresses and sizes of the modules. From breakpad info, it would be hard to determine the size of the module if the highest address is occupied by a "PUBLIC" symbol, as they don't have any sizes associated with them.

Going back to case 1, if we have a stripped elf file, the breakpad file should latch onto the sections of that one without ever knowing the difference. So in the end, I hope this will produce a clearer code because the concerns will be separated. Breakpad code will always deal with externally-provided sections, regardless of whether they come from a "real" object file, or a made-up one. And the "making-up" code can work independently of there being a breakpad file.

If you plan on not making the breakpad file ever stand alone, then you will need to take any addresses and look them up in the module section list and use the other sections. I don't see why the breakpad file can't be stand alone though. It won't be as accurate, but it sure would be nice to be able to load a bunch of them into LLDB without needing to find the original executable and just symbolicate no?

I could try to make it stand-alone, but that seems to me like a duplication of effort. And since the sections I could conjure up from the breakpad info would never match the original elf file, I would have to support both cases anyway, one with using the sections from the object file, and one with the own, made-up sections.

I do intend to support both cases, but in a slightly different way. The way, I see it we have actually four cases to consider:

  1. we have an stripped elf file, and a breakpad symbol file (the case of an unstripped elf file is uninteresting, as it will have much better debug info than breakpad can possibly provide)
  2. we don't have an elf file, but we have a breakpad file
  3. we don't have an elf file nor a breakpad file

Because of case 3, we have to do some section conjuring independently of any breakpad file. We already do that to some extent, and @lemo is getting ready to extend that in D55142. Once we have that, and we find a breakpad file for this module (case 2), the breakpad file should be able to just latch onto the sections created for the placeholder object file. And I believe ProcessMinidump is in a better position to create the "placeholder" sections, as it has both access to the load addresses and sizes of the modules. From breakpad info, it would be hard to determine the size of the module if the highest address is occupied by a "PUBLIC" symbol, as they don't have any sizes associated with them.

Going back to case 1, if we have a stripped elf file, the breakpad file should latch onto the sections of that one without ever knowing the difference. So in the end, I hope this will produce a clearer code because the concerns will be separated. Breakpad code will always deal with externally-provided sections, regardless of whether they come from a "real" object file, or a made-up one. And the "making-up" code can work independently of there being a breakpad file.

Ok. Check out my changes that parse region info:
https://reviews.llvm.org/D55522

It parses the memory region info from the linux maps info if it is available. In breakpad generated minidumps, this will give us enough info to correctly create sections for all object files in case #3!

Ok. Check out my changes that parse region info:
https://reviews.llvm.org/D55522

It parses the memory region info from the linux maps info if it is available. In breakpad generated minidumps, this will give us enough info to correctly create sections for all object files in case #3!

Yes, that sounds like a really useful source of information.

I've uploaded D56173 to demonstrate how I intend to use the sections created here. The latter patch still requires some changes I only have locally (needed to make base address available to it), but the part about handling the sections is not affected by that.

labath requested review of this revision.Jan 3 2019, 2:34 AM

Greg, given the intended usage of these sections as demonstrated in D56173, do you agree with representing the sections of the breakpad object file in this way?

So is this done as one section per function? Or one section for contiguous functions? What about if there are only symbols? I tried to read the code but wasn't able to decipher everything clearly in my head.

Just read the original description again and now code makes sense. Main questions for me: is there a benefit to creating multiple sections? Can we just create one section and name it ".breakpad"? Should we not try to find a section that contains the address from the FUNC, line entry or PUBLIC and then avoid creating a section? That way we only create breakpad sections of there are not backing sections from a real object or symbol file?

labath added a comment.Jan 4 2019, 1:56 AM

I was under the impression I've convinced you of this direction, but your questions make it sound like you're going back to the "standalone" breakpad file idea (which I am not fond of). I'll try to explain again what I'm doing here. This is going to be somewhat repetitive (for which I apologise), but I am trying to explain this from a slightly different angle this time.

The sections I'm creating here aren't the kind of sections that will be loaded in memory. They're non-loadable sections (like the sections without SHF_ALLOC flag in elf), whose only purpose is to carry data around. Similar to how .debug_info is a non-loadable section that carries DWARF data. In this code, I'm not trying to infer anything about the layout of the described executable file from the data in the breakpad file. I am only presenting a view of the data in the breakpad file, so that this connection can happen in SymbolFileBreakpad. So, ObjectFileBreakpad will create a section whose contents will be _literally_

PUBLIC 1010 0 function1
PUBLIC 1020 0 function2
...

Then, SymbolFileBreakpad will take this section, parse it (like SymbolFileDWARF parses .debug_info), cross-reference the information with the real object, and create appropriate symbols. This happens in D56173. I am currently working on other patches which take the line records from the breakpad file and create line tables. So here, ObjectFileBreakpad will provide a "FUNC" section (because in breakpad files line records are attached to the preceding FUNC record), similar to how ObjectFileELF provides .debug_line. Then SymbolFileBreakpad parses and presents it to LLDB (like SymbolFileDWARF parses .debug_line).

In this sense, a breakpad file should be similar to a symbol-only ELF file (the kind you produce with strip --only-keep-debug) -- this one also doesn't contain any loadable sections, and is merely a container for the symbol data.

When I speak about "discontinuity" in this patch, it means discontinuity in the descriptions themselves, not in the data being described. So a breakpad file like:

FUNC 1000 10 0 function1
FILE 0 /tmp/foo.c
FUNC 1010 10 0 function2

is discontinuous because the two FUNC records are not next to each other even though the functions themselves are positioned one after the other. (I don't know why would anyone produce files like these, but the breakpad format description https://chromium.googlesource.com/breakpad/breakpad/+/master/docs/symbol_files.md explicitly allows that).

Also note that neither of these ObjectFileBreakpad nor SymbolFileBreakpad creates any loadable sections. I don't think that is necessary, as that can be done elsewhere (and better). I just use whatever sections are present in the main object file of the module. In practice this will either be a real loadable object file (elf/macho/coff), or a placeholder object file that is created when opening a minidump file. I think this makes sense for several reasons:

  • determining the limits of the loadable section from the breakpad info is hard. There will always be some loaded data (various file headers, etc.) before the first symbol described by the breakpad file. And we won't also cannot be sure of the upper limit of the section if the last symbol is a PUBLIC symbol (as they don't have size). On the other hand, creating this from the minidump info is easy, as it knows the exact ranges (coming from /proc/pid/maps) and similar.
  • better composability: having ObjectFileBreakpad be standalone will not allow us to get rid of the placeholder object files, as those will be still needed in cases when we don't have even the breakpad info. So we will need two branches in ObjectFileBreakpad (for when we have an object file vs. when we don't), and then placeholder files on top of that. Making breakpad files not be standalone let's us get rid of one of the branches in ObjectFileBreakpad
  • Overall, I think object files should be as standalone as possible. They should not infer anything based on the information in other object files or elsewhere. They should just present the data that's present in the file itself and nothing more. Combining of data from should be done at a different level.

If this still hasn't convinced you :), and you think the standalone breakpad file is the better way to go, then I'd like to understand what are the advantages you see there, because right now I don't see any. In previous comments you were worried about being able to use a breakpad file without the matching exe file. If that isn't clear from the above, then I can reiterate that I do intend to support that flow. In fact, it is my primary use case. The only difference is I intend to achieve it via a combination of placeholder object files plus symbol information from breakpad files, rather than breakpad (object) files alone.

clayborg accepted this revision.Jan 4 2019, 11:08 AM

You have convinced me! Sorry I had paged out the original intent you conveyed from before the break. Thanks for the details.

This revision is now accepted and ready to land.Jan 4 2019, 11:08 AM
labath accepted this revision.Jan 7 2019, 3:15 AM

Thank you for the review.

labath removed a reviewer: labath.Jan 7 2019, 3:15 AM

Removing the self-accept (oops).

This revision was automatically updated to reflect the committed changes.

@labath This broke lldb on Debian stable:

In file included from /build/llvm-toolchain-snapshot-8~svn350764/tools/lldb/source/Utility/DataExtractor.cpp:10:
/build/llvm-toolchain-snapshot-8~svn350764/tools/lldb/include/lldb/Utility/DataExtractor.h:1099:29: error: non-constant-expression cannot be narrowed from type 'uint64_t' (aka 'unsigned long long') to 'size_t' (aka 'unsigned int') in initializer list [-Wc++11-narrowing]
    return {GetDataStart(), GetByteSize()};
                            ^~~~~~~~~~~~~
/build/llvm-toolchain-snapshot-8~svn350764/tools/lldb/include/lldb/Utility/DataExtractor.h:1099:29: note: insert an explicit cast to silence this issue
    return {GetDataStart(), GetByteSize()};
                            ^~~~~~~~~~~~~
                            static_cast<size_t>( )

Thanks for the heads-up. This should be fixed in r350834.