This is an archive of the discontinued LLVM Phabricator instance.

Minidump parsing
ClosedPublic

Authored by dvlahovski on Aug 16 2016, 3:15 AM.

Details

Summary

This is a work-in-progress minidump parsing code.
There are still some more structures/data streams that need to be added.
The aim ot this is to be used in the implementation of
a minidump debugging plugin that works on all platforms/architectures.
Currently we have a windows-only plugin that uses the WinAPI to parse
the dump files.
Also added unittests for the current functionality.

Diff Detail

Repository
rL LLVM

Event Timeline

dvlahovski updated this revision to Diff 68152.Aug 16 2016, 3:15 AM
dvlahovski retitled this revision from to Minidump parsing.
dvlahovski updated this object.
dvlahovski added reviewers: labath, amccarth.
dvlahovski added a subscriber: lldb-commits.
labath edited edge metadata.Aug 16 2016, 4:18 AM

The implementation looks nice and clean. I have only some stylistic comments giving it finishing touches..

source/Plugins/Process/minidump/CMakeLists.txt
6 ↗(On Diff #68152)

Just remove the commented lines.

source/Plugins/Process/minidump/MinidumpParser.cpp
28 ↗(On Diff #68152)

What's the anticipated usage of this function? I am wondering whether it shouldn't clear the cached data (m_header, m_directory_map) as well. Or maybe, since you already have a matching contructor, just remove the function and have the user construct a new object if he needs to (?)

33 ↗(On Diff #68152)

This kind of comment should go into the header, as it speaks about the interface.

source/Plugins/Process/minidump/MinidumpParser.h
14 ↗(On Diff #68152)

(nit) if you include <cstring>, it's not a C include :)
I'd put that under C++

45 ↗(On Diff #68152)

Does the user need to know that you are returning an iterator? The map sounds like an implementation detail. If you don't have any usages for that in mind, it would be just cleaner to return the MinidumpLocationDescriptor (or maybe a reference to it) directly. Then you could remove the typedef above as well...

source/Plugins/Process/minidump/MinidumpTypes.cpp
83 ↗(On Diff #68152)

Should we abort upon encountering the first invalid thread?

I am thinking of a situation where a corrupt file claims to have UINT32_MAX threads, and we end up spinning here for quite a long time...

source/Plugins/Process/minidump/MinidumpTypes.h
35 ↗(On Diff #68152)

Let's change these into (anonymous) enums as well. We should avoid macros when possible.

120 ↗(On Diff #68152)

This declares a variable. I assume you did not want that.

unittests/Process/minidump/MinidumpParserTest.cpp
10 ↗(On Diff #68152)

I think this is only necessary when you are actually using threads.

51 ↗(On Diff #68152)

Since it's the parser who handles setting the byte order and such, maybe we should just pass the data buffer to it, and let it construct the data extractor internally. Would that work?

dvlahovski updated this revision to Diff 68177.Aug 16 2016, 7:56 AM
dvlahovski edited edge metadata.

Resolving Pavel's remarks

dvlahovski updated this revision to Diff 68180.Aug 16 2016, 8:11 AM

Fixing a little bug - should get the byteorder

after calling SignatureMatchAndSetByteOrder

zturner added inline comments.
source/Plugins/Process/minidump/MinidumpTypes.cpp
31–35 ↗(On Diff #68180)

Why would this be true? Is there a minidump specification somewhere that says that both big endian and little endian are valid byte orderings? I would expect that being a Microsoft created format, it should always be little endian, and if there is a file in big endian we can consider it corrupt.

dvlahovski added inline comments.Aug 16 2016, 9:44 AM
source/Plugins/Process/minidump/MinidumpTypes.cpp
31–35 ↗(On Diff #68180)

I didn't find anything in the MS documentation about the minidump's endianess.
I am doing this because breakpad does this too: https://chromium.googlesource.com/breakpad/breakpad/+/master/src/processor/minidump.cc#4384

amccarth edited edge metadata.Aug 16 2016, 10:00 AM

Are we putting this code in the right place? I wouldn't expect minidump parsing to fall under Plugins/Process.

I assume the eventual intent is to turn the Windows-specific code into a user of your new code? I look forward to seeing that happen.

source/Plugins/Process/minidump/MinidumpTypes.cpp
13 ↗(On Diff #68180)

Include MinidumpTypes.h first in MinidumpTypes.cpp. This helps ensure that headers can stand alone.

source/Plugins/Process/minidump/MinidumpTypes.h
14 ↗(On Diff #68180)

Maybe I'm missing it, but it doesn't seem like this header is using anything from <cstring> (which should be under C++ includes).

40 ↗(On Diff #68180)
// Reference:  https://msdn.microsoft.com/en-us/library/windows/desktop/ms680394.aspx
257 ↗(On Diff #68180)

Why do the arithmetic and static_assert? Why not use sizeof(MinidumpSystemInfo)?

LLVM does have something similar to DataExtractor, but unfortunately it's not present in the correct library for you to easily be able to reuse it. I actually own the code in question, so I'm willing to fix that for you if you're interested, but at the same time the code is very simple. LLVM has some endian-aware data types that are very convenient to work with and mostly eliminate the need to worry about endianness while parsing, which is what DataExtractor mostly does for you. To use LLVM's types, for example, you would change this:

struct MinidumpHeader
{
    uint32_t signature;
    uint32_t version;
    uint32_t streams_count;
    RVA stream_directory_rva; // offset of the stream directory
    uint32_t checksum;
    uint32_t time_date_stamp; // time_t format
    uint64_t flags;

    static bool
    SignatureMatchAndSetByteOrder(DataExtractor &data, lldb::offset_t *offset);

    static llvm::Optional<MinidumpHeader>
    Parse(const DataExtractor &data, lldb::offset_t *offset);
};

to this:

struct MinidumpHeader
{
    support::ulittle32_t signature;
    support::ulittle32_t version;
    support::ulittle32_t streams_count;
    support::ulittle32_t stream_directory_rva; // offset of the stream directory
    support::ulittle32_t checksum;
    support::ulittle32_t time_date_stamp; // time_t format
    support::ulittle64_t flags;
};

All you have to do now is reinterpret_cast the buffer to a MiniDumpHeader* and you're good to go. So pretty much the entirety of the DataExtractor class boils down to a single template function:

template<typename T>
Error consumeObject(ArrayRef<uint8_t> &Buffer, const T *&Object) {
  if (Buffer.size() < sizeof(T))
    return make_error<StringError>("Insufficient buffer!");
  Object = reinterpret_cast<const T*>(Buffer.data());
  Buffer = Buffer.drop_front(sizeof(T));
  return Error::success();
}

For starters, this is nice because it means you're not copying memory around unnecessarily. You're just pointing to the memory that's already there. With DataExtractor you are always copying bytes around. Dump files can be large (even minidumps!) and copying all this memory around is inefficient.

It also makes the syntax cleaner. You have to call different functions on DataExtractor depending on what you want to extract. GetU8 or GetU16 for example. This one function works with almost everything. A few simple template specializations and overloads can make it even more powerful. For example:

Error consumeObject(ArrayRef<uint8_t> &Buffer, StringRef &ZeroString) {
   ZeroString = StringRef(reinterpret_cast<const char *>(Buffer.front()));
   Buffer = Buffer.drop_front(ZeroString.size() + 1);
   return Error::success();
}

I have some more comments on the CL, but I have to run to a meeting, so I will be back later.

zturner added inline comments.Aug 16 2016, 11:13 AM
source/Plugins/Process/minidump/MinidumpParser.cpp
29 ↗(On Diff #68180)

I asked some people familiar with breakpad and windows history, and the answer I got was that while minidump may have support big endian prior to windows 2000 when it supported PPC and MIPS, but probably not anymore.

At this point we have full control over the minidump pipeline. If we want to generate them from within LLDB, we can write them using little endian. And if we want to read them, we can assume they are not from a file prior to windows 2000. I think it's safe to assume little endian unless we have strong evidence that we need to parse a big endian minidump.

35 ↗(On Diff #68180)

Note that this will make a copy of the data. Based on my previous email, this could be written as follows:

const MiniDumpDirectory *directory = nullptr;

for (uint32_t i=0; i < m_header->streams_count; ++i) {
  if (auto EC = consumeObject(data, directory))
    return EC;
  m_directory_map[directory.stream_type = directory->location;
}

No copying is involved here. You would need to make a few changes to do this but I think it is worth it. Same concept applies throughout the rest of this file, so I won't mention it again.

source/Plugins/Process/minidump/MinidumpParser.h
13–25 ↗(On Diff #68180)

Please include files in the following order:

  1. Files in the same directory
  1. LLDB includes
  1. LLVM includes
  1. C and C++ system includes

Although this isn't consistent with the rest of the codebase, it was recently agreed that this is what we want to move towards, so all new code should follow this pattern.

61 ↗(On Diff #68180)

Please use llvm::DenseMap. std::unordered_map should probably not be used for anything ever.

67 ↗(On Diff #68180)

Newline here.

source/Plugins/Process/minidump/MinidumpTypes.h
40 ↗(On Diff #68180)

In LLVM where we parse PDB and CodeView debug info types, we have adopted the convention that these are all enum classes, and we continue to use LLVM naming style instead of using Microsoft all caps style. So I would write this as:

enum class MiniDumpStreamType : uint32_t {
  Unused = 0,
  Reserved0 = 1,
  Reserved1 = 2,
  ThreadList = 3,
  ...
};

The : uint32_t is important since enums are signed by default on Microsoft compilers.

85 ↗(On Diff #68180)

Same goes for these, as well as a few more enums later on. I would make these all enum classes and name them using the LLVM naming conventions.

147 ↗(On Diff #68180)

Enum classes don't play nicely with flags, but LLVM has a file called llvm/ADT/BitmaskEnum.h which improves this significantly. So these can also be an enum class, just use the proper macro from that file so that bitwise operations become possible.

151 ↗(On Diff #68180)

As mentioned in my previous response, all these could be come LLVM endian aware types from llvm/Support/Endian.h' which would allow you to reinterpret_cast` straight from the underlying buffer.

165–166 ↗(On Diff #68180)

The computation here seems unnecessary to me. How about just static_assert(sizeof(MinidumpHeader) == 32);

dvlahovski edited edge metadata.

Update of the code, regarding the remarks

of zturner and amccarth.

dvlahovski added inline comments.Aug 18 2016, 10:09 AM
source/Plugins/Process/minidump/MinidumpTypes.h
258 ↗(On Diff #68557)

The idea of this is to be sure that the compiler didn't align stuff wrongly/not in the way we expect.

Are we putting this code in the right place? I wouldn't expect minidump parsing to fall under Plugins/Process.

I assume the eventual intent is to turn the Windows-specific code into a user of your new code? I look forward to seeing that happen.

My idea was that the new plugin will live in Plugins/Process/minidump.
Do you have a suggestion where the parsing code to be, if not in the same directory?

zturner added inline comments.Aug 18 2016, 10:22 AM
source/Plugins/Process/minidump/MinidumpParser.h
67 ↗(On Diff #68557)

Can this be llvm::DenseMap<MinidumpStreamType, MinidumpLocationDescriptor>?

No point erasing the type information of the enum.

source/Plugins/Process/minidump/MinidumpTypes.cpp
21 ↗(On Diff #68557)

I think these can all just return the pointer instead of llvm::Optional<>. return nullptr to indicate failure. Optionally, make the signature be something like Error MinidumpHeader::Parse(llvm::ArrayRef<uint8_t> &data, const MinidumpHeader *&Header) which would allow you to propagate the error up (if you wanted to).

At the very least though, there's no point using llvm::Optional<> if nullptr can be used to indicate failure.

30 ↗(On Diff #68557)

Where does the 0xFFFF come from?

source/Plugins/Process/minidump/MinidumpTypes.h
133 ↗(On Diff #68557)

Should this be enum class?

zturner added inline comments.Aug 18 2016, 10:25 AM
source/Plugins/Process/minidump/MinidumpTypes.h
259 ↗(On Diff #68180)

I think the static_assert is a good idea. Since this corresponds to an on disk format where the size of the structure is fixed and known, the static assert is a good idea. But the calculation is unnecessary I think. It's more readable if someone can just look at it and say "ok it has to be 32 bytes" or whatever, rather than doing this math in their head.

dvlahovski added inline comments.Aug 18 2016, 10:59 AM
source/Plugins/Process/minidump/MinidumpParser.h
67 ↗(On Diff #68557)

If I use MinidumpStreamType as the keys type then I think I have to specify the getEmptyKey() and getTombstoneKey() functions via DenseMapInfo.
And in the "unsigned" template specialisation of DenseMapInfo:

// Provide DenseMapInfo for unsigned ints.
template<> struct DenseMapInfo<unsigned> {
  static inline unsigned getEmptyKey() { return ~0U; }
  static inline unsigned getTombstoneKey() { return ~0U - 1; }
  static unsigned getHashValue(const unsigned& Val) { return Val * 37U; }
  static bool isEqual(const unsigned& LHS, const unsigned& RHS) {
    return LHS == RHS;
  }
};

So I should probably add there "special" values in the enum as well in order for it to work?

source/Plugins/Process/minidump/MinidumpTypes.h
133 ↗(On Diff #68557)

Yes, forgot to change it ...

dvlahovski added inline comments.Aug 18 2016, 11:03 AM
source/Plugins/Process/minidump/MinidumpTypes.cpp
21 ↗(On Diff #68557)

Yes, fair point. Now that I'm returning pointers, nullptr is better for indicating failure.

30 ↗(On Diff #68557)

There is a comment in the header, but probably it should be here. The higher 16 bit of the version field are implementation specific.
The lower 16 bits are the version number (which is always the same constant number)

zturner added inline comments.Aug 18 2016, 11:06 AM
source/Plugins/Process/minidump/MinidumpParser.h
67 ↗(On Diff #68557)

That's unfortunate, but it looks like you're right. It's probably not worth going to that much effort. It could probably be done by partially specializing DenseMapInfo for all enums, but I don't think it's worth it. So I suppose it's fine to leave as a uint32_t.

Also FYI I copied the implementation of consumeObject to my header.

Are we putting this code in the right place? I wouldn't expect minidump parsing to fall under Plugins/Process.

I assume the eventual intent is to turn the Windows-specific code into a user of your new code? I look forward to seeing that happen.

My idea was that the new plugin will live in Plugins/Process/minidump.
Do you have a suggestion where the parsing code to be, if not in the same directory?

The parsing code will end up being used by multiple plugins--the new one you're building for Linux and the existing one that's Windows-specific.

What I thought would happen would be that you'd write the parsing code first, hook up the Windows-minidump plugin to use it (since the Windows-minidump plugin tests would help validate that your parsing is correct/complete) and then either add a new plugin for non-Windows. That would argue for the minidump parsing to be in a common location that could be used by both plugins. I don't have a good idea of where that would go.

Another approach is to make the Windows plugin platform agnostic once you replace the limited use of the minidump APIs with your own parsing code.

Food for thought. No need to hold up this patch for that. The code can be moved later if appropriate.

source/Plugins/Process/minidump/MinidumpTypes.cpp
22 ↗(On Diff #68557)

I find it amusing that all these Parse methods are in MinidumpTypes.cpp rather than MinidumpParser.cpp. Just an observation--not a request to change it.

source/Plugins/Process/minidump/MinidumpTypes.h
133 ↗(On Diff #68557)

These look like individual bits that will be bitwise-ORed together, which is trickier to do with an enum class.

But you may still want to specify the underlying type, like uint32_t.

258 ↗(On Diff #68557)

OK.

260 ↗(On Diff #68557)

I'll concede that the static_assert is useful. Given that, showing the arithmetic explicitly will be helpful when one of these assertions trips, because you'll be able to see how it's supposed to correspond to the struct.

zturner added inline comments.Aug 18 2016, 11:10 AM
source/Plugins/Process/minidump/MinidumpTypes.h
133 ↗(On Diff #68557)

Yea, see my earlier comment. This enum uses LLVM_MARK_AS_BITMASK_ENUM() which makes this possible.

dvlahovski added inline comments.Aug 18 2016, 11:10 AM
source/Plugins/Process/minidump/MinidumpTypes.h
260 ↗(On Diff #68557)

Yes that was my inital insentive to write explicitly the arithmetic

Do you have a suggestion where the parsing code to be, if not in the same directory?

The parsing code will end up being used by multiple plugins--the new one you're building for Linux and the existing one that's Windows-specific.

I was hoping that we could avoid having a separate plugin for each OS. ProcessElfCore already handles linux and freebsd core files. It should be even simpler for minidumps, as the format there should be more similar. If that does not play out then we will need to move the generic code to a separate place, but I am hoping we can keep it in one plugin (maybe we can have a small class inside the plugin which abstracts any os-specific details we encounter).

What I thought would happen would be that you'd write the parsing code first, hook up the Windows-minidump plugin to use it (since the Windows-minidump plugin tests would help validate that your parsing is correct/complete) and then either add a new plugin for non-Windows. That would argue for the minidump parsing to be in a common location that could be used by both plugins. I don't have a good idea of where that would go.

Another approach is to make the Windows plugin platform agnostic once you replace the limited use of the minidump APIs with your own parsing code.

Food for thought. No need to hold up this patch for that. The code can be moved later if appropriate.

We'll see how it goes...

source/Plugins/Process/minidump/MinidumpParser.cpp
22 ↗(On Diff #68557)

You should check whether the data buffer contains more than sizeof(MinidumpHeader) bytes. Or you can just use m_data_sp->GetByteSize() for size, and let the parse function deal with that.

31 ↗(On Diff #68557)

Same thing. How do you know this memory exists? Is that checked somewhere?

60 ↗(On Diff #68557)

Suggestion: auto iter = ?

source/Plugins/Process/minidump/MinidumpTypes.h
260 ↗(On Diff #68557)

What I did not like about the original approach is that it declared a new global variable (in a header, so it is visible to everyone), just to store the temporary result. I think we should avoid that. I don't really care whether you write sizeof(FOO) = 47 or sizeof(FOO) = 4*10+4+3

dvlahovski updated this revision to Diff 68670.Aug 19 2016, 4:39 AM

Adding sanity checks whether the data buffer

contains the amound of bytes that we want to read.
All of the functions that were returning llvm::Optional<const something *>
now return just const something *
and indicate for failure with the return of nullptr.

I'm thinking of doing the following approach regarding the plugin implementation:
Start writing a new plugin "from scratch", which will be inspired strongly by ProcessWinMiniDump and also borrowing stuff from ProcessElfCore.
Also my plan is to make a single cross-platform Minidump plugin. At least for now I think that there aren't so many differences between Minidumps
generated on different platforms. (If the code gets messy probably a separate plugin for each platform is better.)
Also I think that I could reuse the Windows Minidump tests by pointing them to the new plugin and see if we
have at least the same (working) functionality.

Looks good as far as I am concerned. You might want to add a test or two that makes sure we handle invalid data reasonably. E.g., load only the first X kb of the minidump (so that the thread list is not present in the loaded block), and make sure getting a thread list fails (and does not crash).

source/Plugins/Process/minidump/MinidumpParser.cpp
40 ↗(On Diff #68670)

Please put this before the construction of the ArrayRef - so we don't create a dangling object for no reason.

dvlahovski updated this revision to Diff 68690.Aug 19 2016, 8:11 AM

Move creation of ArrayRef after sanity check

Add a test that checks that the thread list is not present in a
truncated minidump file

Were you still going to change all the Optionals to raw pointers (or even better, llvm::Errors)

source/Plugins/Process/minidump/MinidumpParser.cpp
21 ↗(On Diff #68690)

I'm not crazy about the idea of allowing the constructor to fail and then checking IsValidData() after you construct it. One way LLVM solves this is by having a static function that returns Expected<T>. You would use it like this:

Expected<MinidumpParser> MinidumpParser::create(const lldb::DataBufferSP &data_buf_sp) {
  if (data_buf_sp->GetByteSize() < sizeof(MinidumpHeader))
    return make_error<StringError>("Insufficient buffer!");

  MinidumpHeader *header = nullptr;
  if (auto EC = MinidumpHeader::Parse(header_data, header))
    return std::move(EC);

   ...

   return MinidumpParser(header, directory_map);
}

auto Parser = MinidumpParser::create(buffer);
if (!Parser) {
  // Handle the error
}

This way it's impossible to create a MinidumpParser that is invalid.

Up to you whether you want to do this, but I think it is a very good (and safe) pattern that I'd love to see LLDB start adopting more often.

dvlahovski updated this revision to Diff 69254.Aug 25 2016, 8:33 AM

Changed the constructing pattern of MinidumpParser

Now there is a static Create method that returns
and llvm::Optional MinidumpParser.

I changed all of the llvm::Optional<const type *> returning functions to return only const type *
and signalize for 'failure' by returning a nullptr. In the cases where I return objects (e.g. vector of threads)
I'm still using the llvm::Optional pattern.
I also think that using llvm::Error is a good idea, but I think that it'll be difficult, because of
collisions with the lldb::Error, and also at some point I have to convert objects from one type to the other.

So that's why I stick to the llvm::Optional.
I also used it on the creation of MinidumpParser (following somewhat the pattern that you proposed)

Looks fine to me. Adrian, Zachary, any more thoughts here?

source/Plugins/Process/minidump/MinidumpParser.h
42 ↗(On Diff #69254)

If you have a Create function, the constructor should probably be private (also, explicit is not necessary anymore).

dvlahovski updated this revision to Diff 69340.Aug 26 2016, 4:02 AM

Making MinidumpParser constuctor private

zturner accepted this revision.Aug 26 2016, 7:36 AM
zturner added a reviewer: zturner.
zturner added inline comments.
source/Plugins/Process/minidump/MinidumpParser.cpp
60 ↗(On Diff #69340)

You can just write return parser here. It's implicitly convertible to an llvm::Optional<>.

65 ↗(On Diff #69340)

This isn't really a performance critical class, so it doesn't matter too much, but you could std::move(directory_map) to avoid the copy.

This revision is now accepted and ready to land.Aug 26 2016, 7:36 AM
This revision was automatically updated to reflect the committed changes.
labath added inline comments.Sep 2 2016, 2:37 AM
lldb/trunk/source/Plugins/Process/minidump/MinidumpParser.cpp
146

You have a "enumeration not handled in a switch" warning here. Could you do something about that?

dvlahovski added inline comments.Sep 2 2016, 4:11 AM
lldb/trunk/source/Plugins/Process/minidump/MinidumpParser.cpp
146

Yes, I fixed it locally. Will submit it with the next CL. :)