This is a support patch for D69825.
It handles an exception in the same way as the global exception handler (same side-effect, print call stack & cleanup), and can return the exception code.
It enables CrashRecoveryContext exceptions by default (instead of disabled before) - however the exception handlers are lazily installed on the first creation of a CrashRecoveryContext.
Details
Diff Detail
Event Timeline
This is a support patch for D69825.
Thanks for breaking this out!
It handles an exception in the same way as the global exception handler (same side-effect, print call stack & cleanup), and can return the exception code.
I'm not familiar with this code, so I was struggling with the patch title: "Possibly use exception handler in the Crash Recovery Context in the same way as global exceptions".
Do I understand correctly that the main point is to get a stack trace when CrashRecoveryContext::RunSafely() fails, instead of just returning an error?
It enables CrashRecoveryContext exceptions by default (instead of disabled before) - however the exception handlers are lazily installed on the first creation of a CrashRecoveryContext.
Why does it now need to be enabled by default?
llvm/include/llvm/Support/CrashRecoveryContext.h | ||
---|---|---|
26 | I guess this needs an update, if it's now being enabled by default? | |
104 | ultra nit: period at the end would be nice :) | |
llvm/include/llvm/Support/Signals.h | ||
116 | Any reason not to use "int" for the return type here? "signed" is unusual as a type in LLVM I think. | |
llvm/lib/Support/CrashRecoveryContext.cpp | ||
82 | " = true" would be more common in LLVM code. |
Correct. Otherwise, when running with D69825, the callstack was not displayed anymore when a crash occurred in clang-cl. This side-effect is that it also calls the cleanup, thus the change in clang/test/Modules/signal.m (the temp files are properly deleted in the case of a crash).
It enables CrashRecoveryContext exceptions by default (instead of disabled before) - however the exception handlers are lazily installed on the first creation of a CrashRecoveryContext.
Why does it now need to be enabled by default?
I used to call CrashRecoveryContext::Enable() in CC1Command, just before calling the CC1 callback. And to be consistent, call CrashRecoveryContext::Disable() after the CC1 callback.
But then, if we're running several compilations in parallel, it gets complicated, we can't just toggle it on/off all the time - it would need a refcount as the comment in CrashRecoveryContext::Enable() says.
I thought it'd be easier to just have it always enabled, as the signal handlers won't be installed before usage? Do you prefer that, or the refcounting?
Sorry for the slow response. This is a pretty complicated patch, so I needed some time to wrap my head around it. I'm still not entirely sure I got it right, but here are some high-level comments.
- About making CrashRecoveryContext::Enable() the default, that seems somewhat orthogonal to this change. You mention the use case of running several compilations in parallel, but I don't think that's a scenario that happens today? If doing this is necessary, I think it would be good to break it out into a separate patch.
- The main purpose of the patch is the new "EnableExceptionHandler" mode for CrashRecoveryContext. I find the name a bit confusing, because on Posix there's no exception handler, and on Win32 an exception handler is used to implement CrashRecoverContext::RunSafely, but that's a different one.
What the mode really does is print a stack trace, write a minidump (on windows), and run cleanups. Perhaps a better name for that would be "DumpStackAndCleanupOnFailure"? Maybe they could be separate flags even.
- The mechanism for doing the stack dump and cleanups looks a bit scary to me. That's sys::InvokeExceptionHandler, a name which doesn't really make sense on Posix where it calls SignalHandler, and passes in the RetCode argument as the signal parameter. On Win32 it calls LLVMUnhandledExceptionFilter, and it does this inside the SEH filter function, although I think it will always return EXCEPTION_EXECUTE_HANDLER, so it could just be called inside the exception handler instead?
On Posix, the call to SignalHandler looks scary, because it does a bunch of signal stuff, and e.g. on S390 hosts it raises the signal again to make sure to crash (which I don't think we want inside a CrashRecoveryContext).
I wish the stack dumping and cleanup could be factored out into something that doesn't involve all the signal stuff, and that could have a clean interface that CrashRecoveryContext could simply call when it catches a crash. Maybe sys::DumpStackAndCleanup(...). Do you think this would be possible?
Thanks for analyzing this!
Even for non-parallel compilations, we still need to enable the CrashRecoveryContext inside the new CC1Command, just before calling RunSafely().
If you look at a previous diff of D69825, in clang/lib/Driver/Job.cpp, L389 there is:
static bool CRCEnabled{}; if (!CRCEnabled) { llvm::CrashRecoveryContext::Enable(); CRCEnabled = true; }
Which I removed, because I find it a bit awkward to "enable" something that should be decided internally by CrashRecoveryContext (this Enable API sounds more like internal behaviour leaking outside of CrashRecoveryContext).
The purpose of Enable/Disable is either for debugging (when enabling LIBCLANG_DISABLE_CRASH_RECOVERY) or possibly bubbling up the crash, as described in D23662.
Also consider the scenario where we sequentially compile several TUs: clang-cl a.cpp b.cpp c.cpp. Only the first call to CC1Command::Execute() should call llvm::CrashRecoveryContext::Enable().
It is orthogonal, I can send a separate patch. Should we instead call llvm::CrashRecoveryContext::Enable() in D69825, when clang.exe starts?
- The main purpose of the patch is the new "EnableExceptionHandler" mode for CrashRecoveryContext. I find the name a bit confusing, because on Posix there's no exception handler, and on Win32 an exception handler is used to implement CrashRecoverContext::RunSafely, but that's a different one.
What the mode really does is print a stack trace, write a minidump (on windows), and run cleanups. Perhaps a better name for that would be "DumpStackAndCleanupOnFailure"? Maybe they could be separate flags even.
Will do.
- The mechanism for doing the stack dump and cleanups looks a bit scary to me. That's sys::InvokeExceptionHandler, a name which doesn't really make sense on Posix where it calls SignalHandler, and passes in the RetCode argument as the signal parameter. On Win32 it calls LLVMUnhandledExceptionFilter, and it does this inside the SEH filter function, although I think it will always return EXCEPTION_EXECUTE_HANDLER, so it could just be called inside the exception handler instead?
On Posix, the call to SignalHandler looks scary, because it does a bunch of signal stuff, and e.g. on S390 hosts it raises the signal again to make sure to crash (which I don't think we want inside a CrashRecoveryContext).
I wish the stack dumping and cleanup could be factored out into something that doesn't involve all the signal stuff, and that could have a clean interface that CrashRecoveryContext could simply call when it catches a crash. Maybe sys::DumpStackAndCleanup(...). Do you think this would be possible?
I'll make it possible :-)
This is really cool; we've wanted this for a long time.
One concern I have is that I think this will interfere (effectively disable) automatic OS handling of these crashes, which means they won't be collated and reported anymore. Can we make this configurable somehow? (E.g., leave behind an -ffork-cc1 -fno-fork-cc1 and a CMake flag to pick? Or just the CMake flag?)
(I also have a couple of minor nits inline.)
llvm/lib/Support/CrashRecoveryContext.cpp | ||
---|---|---|
114 | Can you move the nullptr default assignments back to the header, so it's still obvious how they are set up? | |
llvm/lib/Support/Unix/Signals.inc | ||
348 | I'm not sure I've seen signed in code before. Might it be simpler to just say int for the return type? Or is that somehow incorrect/confusing? |
(I'm not sure we'd want to expose this on the driver command-line. But making it just a CMake flag is awkward because then you can't run tests for both without configuring twice. Another option that would keep both paths tested is to have an environment variable such as CLANG_FORK_CC1={0,1}, with a CMake flag to pick the default behaviour.)
Sorry, you can ignore all of my high-level comments, I thought I was commenting on https://reviews.llvm.org/D69825.
Addressed comments, and:
- Removed CrashRecoveryContext::HandleCrash() and #pragma clang handle_crash which wasn't covered by any test and most likely deprecated by #pragma clang crash
- Added refcounting to CrashRecoveryContext::Enable/Disable as well as lazy installation of the exception handlers in CrashRecoveryContext.
- Partially reverted rG4624e83ce7b124545b55e45ba13f2d900ed65654/llvm/lib/Support/Unix/Signals.inc to simplify checks for IntSigs (and to fix the issue mentioned here). @vsk
All tests pass on Windows/MSVC2019/Clang9/10 and Linux/Clang9/GCC9.1.
NB: If desired, I could split this down further into smaller patches (before commit?), to simplify bisection if anything goes bad later on.
llvm/include/llvm/Support/CrashRecoveryContext.h | ||
---|---|---|
26 | I've reverted to the previous behavior, as described in the comment. |
clang/lib/Lex/Pragma.cpp | ||
---|---|---|
1108 ↗ | (On Diff #232854) | This was added in http://llvm.org/r111449 and looks like it was just a debugging aid. Perhaps you could just remove it in a separate change (that is, without waiting for the rest of this change). |
llvm/include/llvm/Support/CrashRecoveryContext.h | ||
105 | The exception terminology only really applies on Windows. Maybe "Selects whether handling of failures should be done in the same way as for regular crashes. When this is active, ...". | |
llvm/include/llvm/Support/Signals.h | ||
115 | What is ExceptContext and what does CleanupOnSignal do with it? | |
llvm/lib/Support/CrashRecoveryContext.cpp | ||
61 | It would be good to have a comment that explains the RetCode and ExceptContext parameters. | |
92 | Style nit: function names should start with lower-case letter. (The current code is not very consistent, but at least for new names we should try.) Also, could this just just be part of (un)installExceptionOrSignalHandlers()? It Seems like it's essentially just doing some extra checks around calls to those, and it seems unfortunate to have both "install()" and "installExceptionOrSignalHandlers()", but one should not be called directly etc. And on a higher level, it worries me that the set of possible states (enabled/disabled, installed/uninstalled) is so large. I guess this is because CrashRecoveryContexts can be nested and enabled individually? Could we use asserts to make it clearer what states should be possible? For example, in Install() you do "if (gCrashRecoveryEnabledCount <= 0) return". Why would we even hit this code if there are no enabled contexts? Etc. | |
151 | Don't we need to install it here? | |
199 | I guess the "inline" is not necessary? | |
llvm/lib/Support/Unix/Signals.inc | ||
212 | This seems unrelated? Could it be done in a separate patch? | |
358 | Isn't this redundant with the llvm::is_contained check above, after which we return? | |
llvm/unittests/Support/CrashRecoveryTest.cpp | ||
110 ↗ | (On Diff #232854) | Isn't this a pretty surprising API, that after calling Disable(), it's still enabled? |
I looked over this again, and also studied CrashRecoveryContext some more.
I don't really understand why this patch needs to modify the code for how the CRC is enabled and installed, etc.
I thought all we need for in-process-cc1 is to add the DumpStackAndCleanupOnFailure flag and behavior, nothing more.
Do you think this is possible, or am I missing something?
Hi Hans,
Sorry for taking time to respond.
The way Enable()/Disable() is currently implemented will not work when the tool executes jobs in parallel (ie. llvm-buildozer I presented at LLVM conference; or our re-implementation of /MP which I haven't published yet). It needs refcounting, otherwise one instance might disable the CRC while other threads are running, which effectively disables the crash handlers.
But we can discuss that in a separate review if you prefer. I'll remove it from this patch unless you say otherwise.
The way Enable()/Disable() is currently implemented will not work when the tool executes jobs in parallel (ie. llvm-buildozer I presented at LLVM conference; or our re-implementation of /MP which I haven't published yet). It needs refcounting, otherwise one instance might disable the CRC while other threads are running, which effectively disables the crash handlers.
But we can discuss that in a separate review if you prefer. I'll remove it from this patch unless you say otherwise.
Yes, I think it would be good to deal with that separately, to allow for incremental progress.
Changes as suggested by @hans.
llvm/include/llvm/Support/Signals.h | ||
---|---|---|
115 | Improved comment. | |
llvm/lib/Support/CrashRecoveryContext.cpp | ||
92 | Reverted this part. | |
151 | Reverted this part. | |
llvm/lib/Support/Unix/Signals.inc | ||
212 | Reverted this part. | |
358 | Above it is checking for "Info" Sigs, whereas here it is "Int" Sigs. | |
llvm/unittests/Support/CrashRecoveryTest.cpp | ||
110 ↗ | (On Diff #232854) | Reverted this part. |
Looks good to me. Thanks for pushing this forward!
(Only one very minor comment.)
llvm/unittests/Support/CrashRecoveryTest.cpp | ||
---|---|---|
72 ↗ | (On Diff #237038) | Is there a purpose to passing the pointer her, or could it just use incrementGlobal? If we can't, maybe call it WithPointer or something instead (or use an overload) - I was confused by the Cookie part until I realize you just mean it takes some argument. |
Thanks for taking the time @hans!
llvm/unittests/Support/CrashRecoveryTest.cpp | ||
---|---|---|
72 ↗ | (On Diff #237038) | The cookie thing is because: using SignalHandlerCallback = void (*)(void *); /// Add a function to be called when an abort/kill signal is delivered to the /// process. The handler can have a cookie passed to it to identify what /// instance of the handler it is. void AddSignalHandler(SignalHandlerCallback FnPtr, void *Cookie); Sure, I'll change it to an overload. |
llvm/lib/Support/CrashRecoveryContext.cpp | ||
---|---|---|
270 | This implicitly converts a pointer to uintptr_t, which is an error by default in mingw mode; a cast is needed. |
llvm/lib/Support/CrashRecoveryContext.cpp | ||
---|---|---|
270 |
I guess this needs an update, if it's now being enabled by default?