Page MenuHomePhabricator

Introduce ObjectFileBreakpad

Authored by labath on Dec 3 2018, 5:27 AM.



This patch adds the scaffolding necessary for lldb to recognise symbol
files generated by breakpad. These (textual) files contain just enough
information to be able to produce a backtrace from a crash
dump. This information includes:

  • UUID, architecture and name of the module
  • line tables
  • list of symbols
  • unwind information

A minimal breakpad file could look like this:
MODULE Linux x86_64 0000000024B5D199F0F766FFFFFF5DC30 a.out
INFO CODE_ID 00000000B52499D1F0F766FFFFFF5DC3
FILE 0 /tmp/a.c
FUNC 1010 10 0 _start
1010 4 4 0
1014 5 5 0
1019 5 6 0
101e 2 7 0
PUBLIC 1010 0 _start
STACK CFI INIT 1010 10 .cfa: $rsp 8 + .ra: .cfa -8 + ^
STACK CFI 1011 $rbp: .cfa -16 + ^ .cfa: $rsp 16 +
STACK CFI 1014 .cfa: $rbp 16 +

Even though this data would normally be considered "symbol" information,
in the current lldb infrastructure it is assumed every SymbolFile object
is backed by an ObjectFile instance. So, in order to better interoperate
with the rest of the code (particularly symbol vendors).

In this patch I just parse the breakpad header, which is enough to
populate the UUID and architecture fields of the ObjectFile interface.
The rough plan for followup patches is to expose the individual parts of
the breakpad file as ObjectFile "sections", which can then be used by
other parts of the codebase (SymbolFileBreakpad ?) to vend the necessary

Event Timeline

labath created this revision.Dec 3 2018, 5:27 AM
labath added inline comments.Dec 3 2018, 5:56 AM

@lemo: Does this part make sense? It seems that on linux the breakpad files have the INFO CODE_ID section, which contains the UUID without the funny trailing zero. So I could try fetching the UUID from there instead, but only on linux, as that section is not present mac (and on windows it contains something completely different). Right now I compute the UUID on linux by chopping off the trailing zero (as I have to do that anyway for mac), but I could do something different is there's any advantage to that.

clayborg accepted this revision.Dec 3 2018, 7:29 AM

This looks like a good start.

This revision is now accepted and ready to land.Dec 3 2018, 7:29 AM

Very excited to see this work beginning!


INFO CODE_ID, if present, is a better thing to use than what you find in MODULE, except on Windows, where it’s absolutely the wrong thing to use but MODULE is fine.

So, suggested logic:

if has_code_id and not is_win:
  id = code_id
  id = module_id

Aside from special-casing Windows against using INFO CODE_ID, I don’t think you should hard-code any OS checks here. There’s no reason Mac dump_syms couldn’t emit INFO CODE_ID, even though it doesn’t currently.

(In fact, you don’t even need to special-case for Windows. You could just detect the presence of a filename token after the ID in INFO CODE_ID. As your test data shows, Windows dump_syms always puts the module filename here, as in “INFO CODE_ID 5C01672A4000 a.exe”, but other dump_syms will only have the uncorrupted debug ID.

labath updated this revision to Diff 176615.Dec 4 2018, 6:10 AM
  • implement the module_id/code_id logic suggested by Mark Mentovai
  • fix module_id endianness handling to make sure the UUID matches the one we get from the minidump files
labath marked 3 inline comments as done.
labath added inline comments.

Thanks. I've implemented the logic you suggested and fixed byte-swapping issues when parsing the module id.

Note I still have to special-case windows to strip the "age" field from the module_id in order for our UUID to match the ones we normally get on mac. (We do the same thing when opening minidump files:

lemo accepted this revision.Dec 4 2018, 2:32 PM

Looks like a great start


these magic integer literals make it hard to follow the intent - what's special about 33, 40, 8, 16, ... ? (symbolic constants might help)


Nit: I personally prefer not to mix data, type and function members in the same "access" section - is there an LLVM/LLDB guideline which requires everything in the same place?

If not, can you please add a private section for the destructor, followed by a section for the private data members?

zturner added inline comments.Dec 4 2018, 3:21 PM

This shouldn't be necessary, the top-level already recognizes .test extension. You only need a lit.local.cfg if you're changing something.


LLVM already has these functions in Triple.cpp, but they are hidden as private implementations in the CPP file. Perhaps we should expose them from headers in Triple.h.


Consider using StringRef::consumeInteger() here.


Similarly for these lines, by using consume functions everywhere we can get rid of a lot of the math and I think make the code easier to follow.


Instead of having the custom parsing functions above, how about just:

std::tie(os, line) = getToken(line);
std::tie(arch, line) = getToken(line);
llvm::Triple triple(os, "unknown", arch);
if (triple.getArch() == Unknown || triple.getOS() == Unknown)
  return llvm::None;

This way we don't even need to expose the parse functions I commented on earlier, and we can just delete them.


We have GetData() which returns an ArrayRef, and another function toStringRef which converts an ArrayRef to a StringRef. So this might be cleaner to write as auto text = llvm::toStringRef(data_sp->GetData());


Is this always true for breakpad files?


Given that we don't actually store an instance of the header anywhere, we just use it as a constructor parameter, perhaps we could go one step further and move this entire type to an anonymous namespace in the cpp file, and update the constructor to take an ArchSpec and a UUID.

I prefer to avoid nested classes wherever possible since it clutters up the interface, so hiding it to the cpp file is nice.


Shouldn't we also initialize this in SystemInitializerFull?


I would use an explicit type spelling here, but since the function is called GetObjectFile, I don't feel too strongly. It's pretty clear what the return type is.

labath updated this revision to Diff 176781.Dec 5 2018, 3:05 AM
labath marked 17 inline comments as done.

Updated according to review comments.

Also added a couple of tests for invalid inputs.

labath added inline comments.Dec 5 2018, 3:06 AM

Yes, but then lit/Modules/lit.local.cfg overrides it by specifying it's own list of suffixes. I could fix that by adding .test. to that file, or by making that file use +=, but it's not clear to me whether that is better than just being explicit here.

If you have any preference, let me know.


I've already checked out the available functions in llvm::Triple, and unfortunately each of them uses a slightly different form for some of the values. For example, getArchTypeNameForLLVMName uses x86-64 instead of x86_64, parseArch uses i386 instead of x86, parseOS uses linux instead of Linux, and so on...

Since this particular encoding is specific to the breakpad format, it made sense to me to have the parsing functions live here (as opposed to adding new cases to the Triple functions for instance), and leave everything else working with the "canonical" forms.


I've rewritten this to gradually chop bytes off from the start of the string, instead of always indexing into the original one. That should reduce the number of magic numbers (and hopefully reduce confusion).


I don't think consumeInteger can help, as these "fields" are not delimited here in any way, so that function will happily try to parse the whole string. If you had a specific patter in mind let me know (but hopefully the new implementation won't be so bad either).


Cool, I didn't know about that. Thanks.


Well.. the whole point of these files is to provide symbol information, so it would be weird if they were stripped. The breakpad dump_syms allows you to omit generating unwind information, but I don't think that's enough to call this "stripped". It is certainly possible to create a file by hand which contains just a MODULE directive and nothing else, but I would say that is a (non-stripped) file which describes an empty module, and not a stripped file.

In reality, this doesn't really matter, as this function is called from just one place, and I don't think that will be relevant for breakpad files.


Sounds good.


good point

markmentovai added inline comments.Dec 5 2018, 7:18 AM

Correct, "stripped" isn't really useful for Breakpad dump_syms output. What does LLDB do with the result of IsStripped()?

Stripped dump_syms output would be what you get from running dump_syms on a stripped module. I can't imagine why anyone would do this intentionally, but you'd also be hard-pressed to tell that's what had happened given only the dumped symbol file.

labath marked 3 inline comments as done.Dec 5 2018, 7:26 AM
labath added inline comments.

Not much. The only relevant use is linked to above. I don't fully understand that code, but my rough idea is the following: we create a "synthetic" symbol in the main object file when we know some symbol must be at the given address, but we don't know it's name. Then when we are looking up an address and it resolves to this synthetic symbol (and the object file is marked as stripped), we go to the symbol file (if we have one) to see if it can provide us with a name for it.

So this isn't even relevant for breakpad files, as they will never be the "main" object file, but I had to put something here, and "false" seems the best option.

zturner accepted this revision.Dec 6 2018, 11:59 AM
This revision was automatically updated to reflect the committed changes.
labath marked an inline comment as done.