This is an archive of the discontinued LLVM Phabricator instance.

[lldb] Add diagnostics
ClosedPublic

Authored by JDevlieghere on Sep 30 2022, 2:05 PM.

Details

Summary

Around this time last year, I said on the mailing list [1] that I wanted to to transform the reproducers into something that resembles a sysdiagnose on Apple platforms: a collection of files containing a variety of information to help diagnose bugs or troubleshoot issues. This patch adds that framework. Based on lessons learned from the reproducers, I've intentionally tried to keep it small and simple. Different parts of LLDB can register callbacks (this is necessary for layering purposes) that will get called when the diagnostics should be generated.

For this patch, it's hooked up to Greg's statistics, but I have a bunch more things in the pipeline. With the minimal implementation in this patch, when the command line LLDB crashes, it dump the JSON statistics into a temp directory and ask the user to attach them to their bug report.

LLDB diagnostics written to /var/folders/6v/6zkjmd45211_1vxrtbnl0dj00000gn/T/diagnostics-ea34d4
Please include the directory content when filing a bug report

[1] https://lists.llvm.org/pipermail/lldb-dev/2021-September/017045.html

Diff Detail

Event Timeline

JDevlieghere created this revision.Sep 30 2022, 2:05 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 30 2022, 2:05 PM
JDevlieghere requested review of this revision.Sep 30 2022, 2:05 PM

Fix baseline

I like the direction, thanks for working on this.

To no one's surprise, I was thinking about testing :)

  • Errors related to the filesystem e.g. can't open a path, is too complicated to setup on demand. The code looks to be passing on the error, which is good enough most of the time.
  • The callbacks are added in c++, so testing if you add A then remove B and so on doesn't make sense on the same copy of lldb (and upstream will always have the same set of callbacks anyway).
  • Dumping statistics uses an existing method, you just make up the filename. So it's the statistics code's responsibility to test what that function does.

Maybe there could be a smoke test that just checks that the dir is made and has some files ending in stats.json in it? I think clang does something like this though they may have a --please-crash option too.

Unrelated - there's maybe a situation where the same set of callbacks are added in a different order, perhaps? But am I correct in thinking that this isn't an issue because these will be writing to different files? Stats has one set of files, logging has another set, etc. So the end result is always a dir of separate files one or more for each thing that registers.

lldb/include/lldb/Utility/Diagnostics.h
30

Can this be private? Code should only use Initialize if I am reading this right.

lldb/include/lldb/Utility/Log.h
231

These seem unrelated but not doing any harm.

lldb/source/API/SBDebugger.cpp
222

What is cookie used for? (or rather, used elsewhere)

lldb/source/Utility/Diagnostics.cpp
44

Is it worth adding an assert that the callback is not already in the list?

(and below, that the callback is in fact in the list)

Adding @rupprecht, as he might be interested in following this.

I don't want to get too involved in this, but the first thought on my mind is "do you really want to do all this from within a signal handler?". The fact that we're running this code means that we're already in a bad state, and its pretty hard to say how far we will manage to get without triggering another crash. At the very least, I think we should print the crash dump directory as the first order of business, before we start running random callbacks.

There are various ways to solve that, like moving the dumping code to another process, or being very careful about what you run inside the crash handler. The one thing that they all have in common is that they are harder to design/implement that what you have now. So, if you want try this out, and accept the risk that it may not be enough to capture all the crashes you're interested in, then I won't stand in your way...

lldb/include/lldb/Target/Statistics.h
180

What's the difference between this class and DebuggerStats above? e.g. if I wanted to add some new method, how would I know where to add it?

lldb/source/Utility/Diagnostics.cpp
55

I am not sure how common this is, but I have recently seen (not in lldb, but another app) a bug, which essentially caused two threads to crash at once (one SEGV, one ABRT). In those situations, you probably don't want to crash-printing routines to race with each other. You can consider putting a (global) mutex in this function, or something like that. (unless the llvm function takes care of that already).

I like the direction, thanks for working on this.

To no one's surprise, I was thinking about testing :)

  • Errors related to the filesystem e.g. can't open a path, is too complicated to setup on demand. The code looks to be passing on the error, which is good enough most of the time.
  • The callbacks are added in c++, so testing if you add A then remove B and so on doesn't make sense on the same copy of lldb (and upstream will always have the same set of callbacks anyway).
  • Dumping statistics uses an existing method, you just make up the filename. So it's the statistics code's responsibility to test what that function does.

Maybe there could be a smoke test that just checks that the dir is made and has some files ending in stats.json in it? I think clang does something like this though they may have a --please-crash option too.

I appreciate you thinking this through. Back in the days of the reproducers, we would generate a reproducer on a crash (similar to what we do here) or through a command (reproducer generate). I think we'll want to a command here too for the case where something went wrong but without crashing LLDB. I think that part should be easy to test and I was planning on adding a test as part of that.

Unrelated - there's maybe a situation where the same set of callbacks are added in a different order, perhaps? But am I correct in thinking that this isn't an issue because these will be writing to different files? Stats has one set of files, logging has another set, etc. So the end result is always a dir of separate files one or more for each thing that registers.

Yeah, in my mind all the callbacks should be orthogonal. If it wasn't for the layering, I'd make it all the responsibility of the Diagnostics class. If someone has a better idea than callbacks please let me know.

Adding @rupprecht, as he might be interested in following this.

I don't want to get too involved in this, but the first thought on my mind is "do you really want to do all this from within a signal handler?". The fact that we're running this code means that we're already in a bad state, and its pretty hard to say how far we will manage to get without triggering another crash.

We had a similar issue for the reproducers. In my experience, we mostly got away with it, but as you said, there's no guarantees. It all depends on what exactly what the callback is doing. Maybe that's something that should be controllable: e.g. some callbacks might want to do more/less based on whether they're called from a signal handler vs when they're triggered from a command.

That said, I see the crash case as a "best effort" kind of thing.

At the very least, I think we should print the crash dump directory as the first order of business, before we start running random callbacks.

That's a good point.

There are various ways to solve that, like moving the dumping code to another process, or being very careful about what you run inside the crash handler. The one thing that they all have in common is that they are harder to design/implement that what you have now. So, if you want try this out, and accept the risk that it may not be enough to capture all the crashes you're interested in, then I won't stand in your way...

Yup that's the approach we took for the reproducer. If you remember, we found that copying a bunch of files wasn't the best thing to do in a signal handler (surprise surprise) so we moved that work out of process (but still wrote out the mapping, similar to clang). I think we can take a similar approach here.

I think that part should be easy to test and I was planning on adding a test as part of that.

Sounds good to me. Nothing needed in this change then.

A command also gives you a chance to get the diagnostics out if running until the actual crash fails to dump them, potentially due to the signal handler issues Pavel is referring to.

Yeah, in my mind all the callbacks should be orthogonal. If it wasn't for the layering, I'd make it all the responsibility of the Diagnostics class. If someone has a better idea than callbacks please let me know.

I do like that they keep the Diagnostics class free of details of the others. Which means you just extend the existing tests for those other classes each time you add a dump function to them.

JDevlieghere marked 4 inline comments as done.Oct 10 2022, 8:52 AM
JDevlieghere added inline comments.
lldb/include/lldb/Utility/Log.h
231

This was unintentional, probably my editor removing trailing whitespace and clang-format kicking in because the line got changed.

lldb/source/API/SBDebugger.cpp
222

It's like the baton we sometimes pass around, just a way to pass additional data to your signal handler.

lldb/source/Utility/Diagnostics.cpp
55

LLVM uses an atomic flag to make sure an exception handler is only run once.

JDevlieghere marked 3 inline comments as done.

Address code review feedback

Changed my mind (again). I made the Callback a std::function so we can use lambas. I used function pointers originally so we could easily compare and remove callbacks, but there's too many places where this overcomplicates things.

clayborg added inline comments.Oct 10 2022, 5:27 PM
lldb/source/Target/Statistics.cpp
293

Should this return a Expected<std::vector<std::string>> in case the caller wants to know which files were created by this call?

305

What happens if the file already exists here? Will this return an error? I am guessing we are passing in a temp directory for the logs

309

There are many syscalls and OS APIs that aren't supposed to be called during a interrupt handler, and this function might call many of them. Deadlocks are a serious concern here as well since any thread might have a mutex locked that could stop the statistics reporting in its tracks since the program was asynchronously interrupted. If this does happen this process will deadlock forever and not be able to respond. We might need a bool parameter do "ReportStatistics" like "bool during_interrupt" and if this is true, we need to try and lock any module mutexes and if they fail, don't report that module. So the entire ReportStatistics function will need to be checked for potential deadlocks, mostly the Module mutex.

DavidSpickett added inline comments.Oct 11 2022, 1:30 AM
lldb/source/Utility/Diagnostics.cpp
44

Is this still relevant?

JDevlieghere marked 2 inline comments as done.Oct 12 2022, 1:29 PM
JDevlieghere added inline comments.
lldb/source/Target/Statistics.cpp
305

The file will get overwritten. My reasoning was that maybe you generated the statistics and then you were like oh wait an error showed up after I did a step, let me regenerate them. Alternatives are:

  • Erroring out if the file exists
  • Erroring out if any file exists in the diagnostic dir

I prefer that because we can check it once and give an actionable error to the user. Let me know what you think.

309

I thought almost everything in the statistics was computed upfront, but if that's not the case then we need to be careful. I picked the statistics as an example of some data that can be emitted here. I don't want to block the broader infrastructure on that so I'll hoist it into a separate patch, similar to the ones for the logs.

lldb/source/Utility/Diagnostics.cpp
44

Not anymore, you cannot compare std::functions.

JDevlieghere marked 2 inline comments as done.

Split off the statistics

clayborg requested changes to this revision.Oct 25 2022, 4:33 PM
clayborg added inline comments.
lldb/lldb/include/lldb/Utility/Diagnostics.h
1 ↗(On Diff #469399)

There are two of these files:

lldb/include/lldb/Utility/Diagnostics.h
lldb/lldb/include/lldb/Utility/Diagnostics.h

Remove this one I assume?

lldb/lldb/source/Utility/Diagnostics.cpp
1 ↗(On Diff #469399)

Is "lldb/lldb/source/Utility/Diagnostics.cpp" the right directory? Seems wrong

lldb/source/Utility/Diagnostics.cpp
2

Ah, this is the right lldb/source/Utility/Diagnostics.cpp file. Remove the previous lldb/lldb/source/Utility/Diagnostics.cpp incorrect pathed file.

This revision now requires changes to proceed.Oct 25 2022, 4:33 PM

Remove spurious lldb subdir

This revision was not accepted when it landed; it landed in state Needs Review.Oct 31 2022, 2:40 PM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptOct 31 2022, 2:40 PM