Page MenuHomePhabricator

[Support] Possibly use exception handler in the Crash Recovery Context in the same way as global exceptions
ClosedPublic

Authored by aganea on Nov 21 2019, 1:56 PM.

Details

Summary

This is a support patch for D69825.
It handles an exception in the same way as the global exception handler (same side-effect, print call stack & cleanup), and can return the exception code.
It enables CrashRecoveryContext exceptions by default (instead of disabled before) - however the exception handlers are lazily installed on the first creation of a CrashRecoveryContext.

Diff Detail

Event Timeline

aganea created this revision.Nov 21 2019, 1:56 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald Transcript
hans added a comment.Nov 26 2019, 6:34 AM

This is a support patch for D69825.

Thanks for breaking this out!

It handles an exception in the same way as the global exception handler (same side-effect, print call stack & cleanup), and can return the exception code.

I'm not familiar with this code, so I was struggling with the patch title: "Possibly use exception handler in the Crash Recovery Context in the same way as global exceptions".

Do I understand correctly that the main point is to get a stack trace when CrashRecoveryContext::RunSafely() fails, instead of just returning an error?

It enables CrashRecoveryContext exceptions by default (instead of disabled before) - however the exception handlers are lazily installed on the first creation of a CrashRecoveryContext.

Why does it now need to be enabled by default?

llvm/include/llvm/Support/CrashRecoveryContext.h
26

I guess this needs an update, if it's now being enabled by default?

104

ultra nit: period at the end would be nice :)

llvm/include/llvm/Support/Signals.h
116

Any reason not to use "int" for the return type here? "signed" is unusual as a type in LLVM I think.

llvm/lib/Support/CrashRecoveryContext.cpp
89

" = true" would be more common in LLVM code.

Do I understand correctly that the main point is to get a stack trace when CrashRecoveryContext::RunSafely() fails, instead of just returning an error?

Correct. Otherwise, when running with D69825, the callstack was not displayed anymore when a crash occurred in clang-cl. This side-effect is that it also calls the cleanup, thus the change in clang/test/Modules/signal.m (the temp files are properly deleted in the case of a crash).

It enables CrashRecoveryContext exceptions by default (instead of disabled before) - however the exception handlers are lazily installed on the first creation of a CrashRecoveryContext.

Why does it now need to be enabled by default?

I used to call CrashRecoveryContext::Enable() in CC1Command, just before calling the CC1 callback. And to be consistent, call CrashRecoveryContext::Disable() after the CC1 callback.
But then, if we're running several compilations in parallel, it gets complicated, we can't just toggle it on/off all the time - it would need a refcount as the comment in CrashRecoveryContext::Enable() says.
I thought it'd be easier to just have it always enabled, as the signal handlers won't be installed before usage? Do you prefer that, or the refcounting?

hans added a comment.Nov 29 2019, 4:48 AM

Sorry for the slow response. This is a pretty complicated patch, so I needed some time to wrap my head around it. I'm still not entirely sure I got it right, but here are some high-level comments.

  • About making CrashRecoveryContext::Enable() the default, that seems somewhat orthogonal to this change. You mention the use case of running several compilations in parallel, but I don't think that's a scenario that happens today? If doing this is necessary, I think it would be good to break it out into a separate patch.
  • The main purpose of the patch is the new "EnableExceptionHandler" mode for CrashRecoveryContext. I find the name a bit confusing, because on Posix there's no exception handler, and on Win32 an exception handler is used to implement CrashRecoverContext::RunSafely, but that's a different one.

    What the mode really does is print a stack trace, write a minidump (on windows), and run cleanups. Perhaps a better name for that would be "DumpStackAndCleanupOnFailure"? Maybe they could be separate flags even.
  • The mechanism for doing the stack dump and cleanups looks a bit scary to me. That's sys::InvokeExceptionHandler, a name which doesn't really make sense on Posix where it calls SignalHandler, and passes in the RetCode argument as the signal parameter. On Win32 it calls LLVMUnhandledExceptionFilter, and it does this inside the SEH filter function, although I think it will always return EXCEPTION_EXECUTE_HANDLER, so it could just be called inside the exception handler instead?

    On Posix, the call to SignalHandler looks scary, because it does a bunch of signal stuff, and e.g. on S390 hosts it raises the signal again to make sure to crash (which I don't think we want inside a CrashRecoveryContext).

    I wish the stack dumping and cleanup could be factored out into something that doesn't involve all the signal stuff, and that could have a clean interface that CrashRecoveryContext could simply call when it catches a crash. Maybe sys::DumpStackAndCleanup(...). Do you think this would be possible?

Thanks for analyzing this!

  • About making CrashRecoveryContext::Enable() the default, that seems somewhat orthogonal to this change. You mention the use case of running several compilations in parallel, but I don't think that's a scenario that happens today? If doing this is necessary, I think it would be good to break it out into a separate patch.

Even for non-parallel compilations, we still need to enable the CrashRecoveryContext inside the new CC1Command, just before calling RunSafely().
If you look at a previous diff of D69825, in clang/lib/Driver/Job.cpp, L389 there is:

static bool CRCEnabled{};
if (!CRCEnabled) {
  llvm::CrashRecoveryContext::Enable();
  CRCEnabled = true;
}

Which I removed, because I find it a bit awkward to "enable" something that should be decided internally by CrashRecoveryContext (this Enable API sounds more like internal behaviour leaking outside of CrashRecoveryContext).
The purpose of Enable/Disable is either for debugging (when enabling LIBCLANG_DISABLE_CRASH_RECOVERY) or possibly bubbling up the crash, as described in D23662.
Also consider the scenario where we sequentially compile several TUs: clang-cl a.cpp b.cpp c.cpp. Only the first call to CC1Command::Execute() should call llvm::CrashRecoveryContext::Enable().
It is orthogonal, I can send a separate patch. Should we instead call llvm::CrashRecoveryContext::Enable() in D69825, when clang.exe starts?

  • The main purpose of the patch is the new "EnableExceptionHandler" mode for CrashRecoveryContext. I find the name a bit confusing, because on Posix there's no exception handler, and on Win32 an exception handler is used to implement CrashRecoverContext::RunSafely, but that's a different one.

    What the mode really does is print a stack trace, write a minidump (on windows), and run cleanups. Perhaps a better name for that would be "DumpStackAndCleanupOnFailure"? Maybe they could be separate flags even.

Will do.

  • The mechanism for doing the stack dump and cleanups looks a bit scary to me. That's sys::InvokeExceptionHandler, a name which doesn't really make sense on Posix where it calls SignalHandler, and passes in the RetCode argument as the signal parameter. On Win32 it calls LLVMUnhandledExceptionFilter, and it does this inside the SEH filter function, although I think it will always return EXCEPTION_EXECUTE_HANDLER, so it could just be called inside the exception handler instead?

    On Posix, the call to SignalHandler looks scary, because it does a bunch of signal stuff, and e.g. on S390 hosts it raises the signal again to make sure to crash (which I don't think we want inside a CrashRecoveryContext).

    I wish the stack dumping and cleanup could be factored out into something that doesn't involve all the signal stuff, and that could have a clean interface that CrashRecoveryContext could simply call when it catches a crash. Maybe sys::DumpStackAndCleanup(...). Do you think this would be possible?

I'll make it possible :-)

hans added a comment.Nov 29 2019, 6:19 AM

Thanks for analyzing this!

  • About making CrashRecoveryContext::Enable() the default, that seems somewhat orthogonal to this change. You mention the use case of running several compilations in parallel, but I don't think that's a scenario that happens today? If doing this is necessary, I think it would be good to break it out into a separate patch.

Even for non-parallel compilations, we still need to enable the CrashRecoveryContext inside the new CC1Command, just before calling RunSafely().
If you look at a previous diff of D69825, in clang/lib/Driver/Job.cpp, L389 there is:

static bool CRCEnabled{};
if (!CRCEnabled) {
  llvm::CrashRecoveryContext::Enable();
  CRCEnabled = true;
}

Which I removed, because I find it a bit awkward to "enable" something that should be decided internally by CrashRecoveryContext (this Enable API sounds more like internal behaviour leaking outside of CrashRecoveryContext).
The purpose of Enable/Disable is either for debugging (when enabling LIBCLANG_DISABLE_CRASH_RECOVERY) or possibly bubbling up the crash, as described in D23662.
Also consider the scenario where we sequentially compile several TUs: clang-cl a.cpp b.cpp c.cpp. Only the first call to CC1Command::Execute() should call llvm::CrashRecoveryContext::Enable().
It is orthogonal, I can send a separate patch. Should we instead call llvm::CrashRecoveryContext::Enable() in D69825, when clang.exe starts?

Yes, maybe enabling it on startup is the best way to do it.

jkorous added a subscriber: jkorous.Dec 3 2019, 4:17 PM

This is really cool; we've wanted this for a long time.

One concern I have is that I think this will interfere (effectively disable) automatic OS handling of these crashes, which means they won't be collated and reported anymore. Can we make this configurable somehow? (E.g., leave behind an -ffork-cc1 -fno-fork-cc1 and a CMake flag to pick? Or just the CMake flag?)

(I also have a couple of minor nits inline.)

llvm/lib/Support/CrashRecoveryContext.cpp
100

Can you move the nullptr default assignments back to the header, so it's still obvious how they are set up?

llvm/lib/Support/Unix/Signals.inc
348

I'm not sure I've seen signed in code before. Might it be simpler to just say int for the return type? Or is that somehow incorrect/confusing?

Can we make this configurable somehow? (E.g., leave behind an -ffork-cc1 -fno-fork-cc1 and a CMake flag to pick? Or just the CMake flag?)

(I'm not sure we'd want to expose this on the driver command-line. But making it just a CMake flag is awkward because then you can't run tests for both without configuring twice. Another option that would keep both paths tested is to have an environment variable such as CLANG_FORK_CC1={0,1}, with a CMake flag to pick the default behaviour.)

Sorry, you can ignore all of my high-level comments, I thought I was commenting on https://reviews.llvm.org/D69825.

aganea updated this revision to Diff 232854.EditedDec 9 2019, 7:36 AM
aganea marked 8 inline comments as done.
aganea added a subscriber: vsk.

Addressed comments, and:

  • Removed CrashRecoveryContext::HandleCrash() and #pragma clang handle_crash which wasn't covered by any test and most likely deprecated by #pragma clang crash
  • Added refcounting to CrashRecoveryContext::Enable/Disable as well as lazy installation of the exception handlers in CrashRecoveryContext.
  • Partially reverted rG4624e83ce7b124545b55e45ba13f2d900ed65654/llvm/lib/Support/Unix/Signals.inc to simplify checks for IntSigs (and to fix the issue mentioned here). @vsk

All tests pass on Windows/MSVC2019/Clang9/10 and Linux/Clang9/GCC9.1.

NB: If desired, I could split this down further into smaller patches (before commit?), to simplify bisection if anything goes bad later on.

llvm/include/llvm/Support/CrashRecoveryContext.h
26

I've reverted to the previous behavior, as described in the comment.

hans added inline comments.Dec 11 2019, 6:16 AM
clang/lib/Lex/Pragma.cpp
1108 ↗(On Diff #232854)

This was added in http://llvm.org/r111449 and looks like it was just a debugging aid. Perhaps you could just remove it in a separate change (that is, without waiting for the rest of this change).

llvm/include/llvm/Support/CrashRecoveryContext.h
105

The exception terminology only really applies on Windows. Maybe "Selects whether handling of failures should be done in the same way as for regular crashes. When this is active, ...".

llvm/include/llvm/Support/Signals.h
115

What is ExceptContext and what does CleanupOnSignal do with it?

llvm/lib/Support/CrashRecoveryContext.cpp
65–69

It would be good to have a comment that explains the RetCode and ExceptContext parameters.

98

Style nit: function names should start with lower-case letter. (The current code is not very consistent, but at least for new names we should try.)

Also, could this just just be part of (un)installExceptionOrSignalHandlers()? It Seems like it's essentially just doing some extra checks around calls to those, and it seems unfortunate to have both "install()" and "installExceptionOrSignalHandlers()", but one should not be called directly etc.

And on a higher level, it worries me that the set of possible states (enabled/disabled, installed/uninstalled) is so large. I guess this is because CrashRecoveryContexts can be nested and enabled individually?

Could we use asserts to make it clearer what states should be possible? For example, in Install() you do "if (gCrashRecoveryEnabledCount <= 0) return". Why would we even hit this code if there are no enabled contexts? Etc.

140

Don't we need to install it here?

191

I guess the "inline" is not necessary?

llvm/lib/Support/Unix/Signals.inc
212

This seems unrelated? Could it be done in a separate patch?

358

Isn't this redundant with the llvm::is_contained check above, after which we return?

llvm/unittests/Support/CrashRecoveryTest.cpp
108

Isn't this a pretty surprising API, that after calling Disable(), it's still enabled?

hans added a comment.Dec 18 2019, 4:15 AM

I looked over this again, and also studied CrashRecoveryContext some more.

I don't really understand why this patch needs to modify the code for how the CRC is enabled and installed, etc.

I thought all we need for in-process-cc1 is to add the DumpStackAndCleanupOnFailure flag and behavior, nothing more.

Do you think this is possible, or am I missing something?

I looked over this again, and also studied CrashRecoveryContext some more.

I don't really understand why this patch needs to modify the code for how the CRC is enabled and installed, etc.

I thought all we need for in-process-cc1 is to add the DumpStackAndCleanupOnFailure flag and behavior, nothing more.

Do you think this is possible, or am I missing something?

Hi Hans,
Sorry for taking time to respond.
The way Enable()/Disable() is currently implemented will not work when the tool executes jobs in parallel (ie. llvm-buildozer I presented at LLVM conference; or our re-implementation of /MP which I haven't published yet). It needs refcounting, otherwise one instance might disable the CRC while other threads are running, which effectively disables the crash handlers.
But we can discuss that in a separate review if you prefer. I'll remove it from this patch unless you say otherwise.

hans added a comment.Dec 18 2019, 6:40 AM

The way Enable()/Disable() is currently implemented will not work when the tool executes jobs in parallel (ie. llvm-buildozer I presented at LLVM conference; or our re-implementation of /MP which I haven't published yet). It needs refcounting, otherwise one instance might disable the CRC while other threads are running, which effectively disables the crash handlers.
But we can discuss that in a separate review if you prefer. I'll remove it from this patch unless you say otherwise.

Yes, I think it would be good to deal with that separately, to allow for incremental progress.

hans added a comment.Jan 9 2020, 12:51 AM

The way Enable()/Disable() is currently implemented will not work when the tool executes jobs in parallel (ie. llvm-buildozer I presented at LLVM conference; or our re-implementation of /MP which I haven't published yet). It needs refcounting, otherwise one instance might disable the CRC while other threads are running, which effectively disables the crash handlers.
But we can discuss that in a separate review if you prefer. I'll remove it from this patch unless you say otherwise.

Yes, I think it would be good to deal with that separately, to allow for incremental progress.

Did you have a chance to look into this any more?

aganea updated this revision to Diff 237038.Jan 9 2020, 5:46 AM
aganea marked 16 inline comments as done.

Changes as suggested by @hans.

llvm/include/llvm/Support/Signals.h
115

Improved comment.

llvm/lib/Support/CrashRecoveryContext.cpp
98

Reverted this part.

140

Reverted this part.

llvm/lib/Support/Unix/Signals.inc
212

Reverted this part.

358

Above it is checking for "Info" Sigs, whereas here it is "Int" Sigs.

llvm/unittests/Support/CrashRecoveryTest.cpp
108

Reverted this part.

hans accepted this revision.Jan 10 2020, 7:58 AM

Looks good to me. Thanks for pushing this forward!

(Only one very minor comment.)

llvm/unittests/Support/CrashRecoveryTest.cpp
72

Is there a purpose to passing the pointer her, or could it just use incrementGlobal? If we can't, maybe call it WithPointer or something instead (or use an overload) - I was confused by the Cookie part until I realize you just mean it takes some argument.

This revision is now accepted and ready to land.Jan 10 2020, 7:58 AM
aganea marked an inline comment as done.Jan 10 2020, 8:13 AM

Thanks for taking the time @hans!

llvm/unittests/Support/CrashRecoveryTest.cpp
72

The cookie thing is because:

using SignalHandlerCallback = void (*)(void *);

/// Add a function to be called when an abort/kill signal is delivered to the
/// process. The handler can have a cookie passed to it to identify what
/// instance of the handler it is.
void AddSignalHandler(SignalHandlerCallback FnPtr, void *Cookie);

Sure, I'll change it to an overload.

This revision was automatically updated to reflect the committed changes.
mstorsjo added inline comments.
llvm/lib/Support/CrashRecoveryContext.cpp
269

This implicitly converts a pointer to uintptr_t, which is an error by default in mingw mode; a cast is needed.

aganea marked 2 inline comments as done.Jan 12 2020, 11:46 AM
aganea added inline comments.
llvm/lib/Support/CrashRecoveryContext.cpp
269