Page MenuHomePhabricator

MinidumpParsing: pid, modules, exceptions

Authored by dvlahovski on Sep 9 2016, 5:43 AM.



Added parsing of the MiscInfo data stream.
The main member of it that we care about is the process_id
On Linux generated Minidump (from breakpad) we don't have
the MiscInfo, we have the /proc/$pid/status from where we can get the
Also parsing the module list - the list of all of the loaded
modules/shared libraries.
Finally - parsing the exception stream.

I have unit tests for all of that.
Also added some tests using a Minidump generated from Windows tools (not
from breakpad)

Diff Detail


Event Timeline

dvlahovski updated this revision to Diff 70815.Sep 9 2016, 5:43 AM
dvlahovski retitled this revision from to MinidumpParsing: pid, modules, exceptions.
dvlahovski updated this object.
dvlahovski added reviewers: labath, zturner.
dvlahovski added a subscriber: lldb-commits.
dvlahovski updated this revision to Diff 70816.Sep 9 2016, 5:47 AM

Forgot to run clang-format

dvlahovski added a comment.EditedSep 9 2016, 5:52 AM

Also added parsing code for Minidump strings - the strings in the Minidump format are UTF-16 encoded. I used the code from the WinMiniDump plugin and it can extract a UTF-16 string and convert it to a UTF-8 one.

labath edited edge metadata.Sep 9 2016, 8:04 AM

A collection of small remarks, mostly to do with style. Mainly, I'd like to make the tests more strict about what they accept as reasonable values.

A higher level question: Given that we are already in the minidump namespace, do we want all (most) of these symbols to be prefixed with Minidump ? Thoughts?

65 ↗(On Diff #70816)

this will still invoke the copy-constructor of the map. For that to work, your constructor needs to take a rvalue reference to the map.

154 ↗(On Diff #70816)

How hard would it be to actually detect the os properly here?

Since you've already started processing windows minidumps, it would be great if we could have a GetArchitecture() test for both OSs.

Update: So I see you already have that test, but it does not check the OS part of the triple. If it's not too hard for some reason, let's set that as a part of this CL.

21 ↗(On Diff #70816)

This is not consistent with the consumeObject function, which also drops the consumed bytes from the buffer.

127 ↗(On Diff #70816)

pid_t is a host-specific type. Use lldb::pid_t.

148 ↗(On Diff #70816)

This can end up adding 43 elements to the vector. It will still work, but you'll be doubling your memory usage for no reason. If you really want it, make the vector larger, but as this is not performance critical, maybe you could just use SmallVector<StringRef, 0> ?

223 ↗(On Diff #70816)

MinidumpString is just a wrapper around the std::string. Why not return the string directly? (Also the "const" there is unnecessary).

299 ↗(On Diff #70816)

this is oddly formatted now.

307 ↗(On Diff #70816)

It looks like this should be a normal method instead of a static one.

318 ↗(On Diff #70816)

It looks like this should be a normal method instead of a static one.

95 ↗(On Diff #70816)

Also check whether the returned ProcStatus object has the contents you expect?

111 ↗(On Diff #70816)

Fix the todo.

118 ↗(On Diff #70816)

Check the contents of the returned object.

138 ↗(On Diff #70816)

Check MiscInfo contents.

zturner added inline comments.Sep 9 2016, 9:20 AM
88–89 ↗(On Diff #70816)

This is fine, but I prefer to avoid the math wherever possible since it makes code slightly more difficult to read. I would write this as:

auto arr_ref = m_data_sp->GetData().drop_front(rva);
171 ↗(On Diff #70816)

Same as before, an Optional<ArrayRef> seems a bit redundant. I would just return an empty ArrayRef to represent it not being there. I dont' think there's a valid case for a strema to actually be present, but be 0 bytes in length.

202 ↗(On Diff #70816)

It will do this parsing every time you ask for the module list. How about storing a std::vector<const MinidumpModule *> in the MinidumpParser class and then you can return an ArrayRef<const MinidumpModule*> to it. So it's lazy evaluation, only done once, with the side benefit of allowing you to return an ArrayRef<T> instead of a more complicated Optional<vector<T>>

47 ↗(On Diff #70816)

There's a lot of Optional's here. Is it really possible for any subset of these to be optional? Or is it more like if you find one, you will find them all?

61 ↗(On Diff #70816)

I would drop the Optional here and return ArrayRef<const MinidumpModule*>. Returning an entire vector by value is wasteful on the stack. Using an ArrayRef makes it very lightweight, while still providing value-like semantics in that you won't be able to modify the underlyign container. Also, if its size is 0, you can treat that the same as if the Optional did not have a value.

21 ↗(On Diff #70816)

Is this logic correct? A buffer may be arbitrarily large and have more data in it besides the string. Perhaps you need to search forward for a null terminator, then only return that portion of the string, then drop that many bytes (plus the null terminator) from the input buffer?

223 ↗(On Diff #70816)

Agree, the MinidumpString class seems unnecessary to me. Just make parseMinidumpString be a file static global function and return an std::string.

299 ↗(On Diff #70816)

Probably because the comment passed 80 columns. I would put the comment on a new line above the field to fix this.

amccarth added inline comments.Sep 9 2016, 5:43 PM
21 ↗(On Diff #70816)

Minidump strings aren't zero-terminated. They're counted (in UTF16 code units). So this would have to read the count and "consume" the appropriate number of bytes.

Oh, but this isn't used for minidump strings. It looks like it's for these Linux proc status fields.

dvlahovski marked 17 inline comments as done.
dvlahovski edited edge metadata.

Changes regarding all of the comments.

Removed llvm::Optionals where it was unneeded.
Parsing also the OS from the Minidump.

Refined old and added new test cases.

dvlahovski added inline comments.Sep 12 2016, 10:08 AM
222 ↗(On Diff #71026)

Now that I return an ArrayRef<ModuleList>, which is basically a reinterpret cast of a piece of memory to a list of modules, do you think that this is still something good to do?

At least for now, I'm calling this method only once in the DoLoadCore initializing method of the plugin (still unsubmitted code) and then creating the ModuleSpecs for each module.

46 ↗(On Diff #71026)

Specifically for the string - I think it's best to return llvm::Optional<std::string> in order to distinguish between empty string and error. And there are some conditions that need to be met for the string to parsed correctly.

60 ↗(On Diff #71026)

I am returning an ArrayRef<MinidumpModule>. The Modules are consecutive in memory so this seemed like a good idea

21 ↗(On Diff #71026)

So, breakpad specific string in the Minidump files are not UTF-16 string, but just ascii strings in the file (not null terminated).
When I get the stream "LinuxProcStatus" I know how big it is and set it in the ArrayRef.
So basically it is my intention here to parse the entire ArrayRef to a string

zturner added inline comments.Sep 12 2016, 10:48 AM
222 ↗(On Diff #71026)

Yea I suppose this is fine now.

21 ↗(On Diff #71026)

In that case it seems like this function is unnecessary. If all it does is reinterpret an entire ArrayRef as a StringRef, it seems better to do that at each callsite. A generic name like consumeString makes people think of either zero terminated string or length prefixed strength, and this one is neither.

Another option might be to add an additional length argument to the consumeString function, and then call it like llvm::StringRef S = consumeString(Buffer, Buffer.size()) with a comment about why you're treating the entire thing as a string.

Just a couple of tiny comments from me.

86 ↗(On Diff #71026)

What should the function do if we call it with an out-of-bounds index? Right now it will crash in drop_front()...

76 ↗(On Diff #71026)

ASSERT_EQ(0, thread_list.size())

136 ↗(On Diff #71026)


dvlahovski updated this revision to Diff 71151.Sep 13 2016, 5:54 AM
dvlahovski marked 3 inline comments as done.

Removed consumeString; Fixed comparisons in unittests; out-of-bounds check in MinidumpString parsing

dvlahovski updated this revision to Diff 71154.Sep 13 2016, 6:16 AM

Making the LinuxProcStatus a smarter class

by not parsing the pid each time when GetPid is called

Looks good as far as I am concerned.

labath accepted this revision.Sep 13 2016, 7:36 AM
labath edited edge metadata.
This revision is now accepted and ready to land.Sep 13 2016, 7:36 AM
This revision was automatically updated to reflect the committed changes.