This is an archive of the discontinued LLVM Phabricator instance.

Signal handling should be signal-safe
ClosedPublic

Authored by jfb on May 14 2018, 5:06 PM.

Details

Summary

Before this patch, signal handling wasn't signal safe. This leads to real-world
crashes. It used ManagedStatic inside of signals, this can allocate and can lead
to unexpected state when a signal occurs during llvm_shutdown (because
llvm_shutdown destroys the ManagedStatic). It also used cl::opt without custom
backing storage. Some de-allocation was performed as well. Acquiring a lock in a
signal handler is also a great way to deadlock.

We can't just disable signals on llvm_shutdown because the signals might do
useful work during that shutdown. We also can't just disable llvm_shutdown for
programs (instead of library uses of clang) because we'd have to then mark the
pointers as not leaked and make sure all the ManagedStatic uses are OK to leak
and remain so.

Move all of the code to lock-free datastructures instead, and avoid having any
of them in an inconsistent state. I'm not trying to be fancy, I'm not using any
explicit memory order because this code isn't hot. The only purpose of the
atomics is to guarantee that a signal firing on the same or a different thread
doesn't see an inconsistent state and crash. In some cases we might miss some
state (for example, we might fail to delete a temporary file), but that's fine.

Note that I haven't touched any of the backtrace support despite it not
technically being totally signal-safe. When that code is called we know
something bad is up and we don't expect to continue execution, so calling
something that e.g. sets errno is the least of our problems.

A similar patch should be applied to lib/Support/Windows/Signals.inc, but that
can be done separately.

rdar://problem/28010281

Diff Detail

Repository
rL LLVM

Event Timeline

jfb created this revision.May 14 2018, 5:06 PM
dexonsmith requested changes to this revision.May 14 2018, 6:10 PM

I wonder if there's some way to write a test for this. E.g., a unit test that registers a signal handler for SIGUSR1 and then does what used to be necessary to trigger a crash.

lib/Support/Unix/Signals.inc
90 ↗(On Diff #146722)

You've said that removing is not signal-safe, but removal itself is; that's not computing for me. It looks like insert and unregister are not signal-save, but removeAll is signal-safe -- should you use that terminology here?

109 ↗(On Diff #146722)

I think current style guidelines for functions are lowerCamelCase, so this should be "insert".

121–122 ↗(On Diff #146722)

"unregister" instead "Unregister"

133 ↗(On Diff #146722)

"removeAll" instead of "RemoveAll"

166 ↗(On Diff #146722)

Current style doesn't have us repeating the name of what's being documented.

If you're just matching the style in the file you can do a prep NFC commit that fixes the style elsewhere, but either way let's avoid the redundancy in the new code.

256 ↗(On Diff #146722)

Why did you lambda-fy this? (If it's unrelated to the patch, can you split it out into a separate NFC commit?)

Also, I would prefer spelling this as registerHandler (lower camel case) since it's callable.

This revision now requires changes to proceed.May 14 2018, 6:10 PM

I think it is easier to unregister all the signal handlers in the beginning of the llvm_shutdown() since you should be done with all the work when you call llvm_shutdown(). This can also allow safe access ManagedStatics in signal handler, which I don't know if we have more of those other than the lock you are trying to fix. The only difference might be that if an interrupt is triggered when freeing ManagedStatics, llvm signal handler will not be used.

jfb added a comment.May 14 2018, 8:58 PM

(Will address code changes separately)

I wonder if there's some way to write a test for this. E.g., a unit test that registers a signal handler for SIGUSR1 and then does what used to be necessary to trigger a crash.

Hmm I don't think so, at least not a reliable test: most of the time the test wouldn't hit a dangerous interleaving. We could try to make the bad concurrent interleaving more likely, or loop a bunch so it's more likely to hit the race because it tries more, but I think fundamentally it would be a flaky "success" test, and mostly a test that does nothing. Were the code to start being broken, it would become a flaky failure.

I think it is easier to unregister all the signal handlers in the beginning of the llvm_shutdown() since you should be done with all the work when you call llvm_shutdown().

We might want to do this, but I'm worried that llvm_shutdown (and the ManagedStatic dtors it calls) could be used for more than just memory reclamation. If that's the case then it could crash, and then that wouldn't call signal handlers for e.g. stack traces on platforms that wanted it. If it starts taking a while then the CTRL-C handler wouldn't run (which might be fine... Presumably the tmp files are gone?).

I'm OK doing this, but I want to point out potential caveats and see what others think. It would definitely simplify some of the issues.

This can also allow safe access ManagedStatics in signal handler, which I don't know if we have more of those other than the lock you are trying to fix. The only difference might be that if an interrupt is triggered when freeing ManagedStatics, llvm signal handler will not be used.

Using ManagedStatic from a signal handler is fraught with peril:

  • It'll call new on first use. We can guard on isConstructed(), but that's problematic if llvm_shutdown is running (assuming we keep signals on when llvm_shutdown runs). We can force ourselves to always call new when the handler is first installed, but that's a bit fragile (but totally OK).
  • We'll then get a mutex, and we can't grab that mutex from the signal handler, because it the thread handling the signal held the mutex we get a (single-threaded!) deadlock.
  • We could spin a dedicated signal handler thread, but that seems like overkill.
  • We could spin a dedicated "do signal unsafe work" thread and post the work to it from the handler, but that's also overkill and would also require lock-free magic (plus waiting for the thread to finish its work before exiting).

It seems pretty simple to have lock-free stuff instead. ManagedStatic gives the feeling that everything is OK, just use a std::vector and all, but it's really not! Imagine that vector was in the process of growing and moving its content when the signal handler kicks in... You'd observe invalid state for sure. Lock-free code, while tricky and slightly ugly, makes it pretty obvious that everything is not OK, and you can't just use std::vector (or even std::atomic<std::pair<int, int>> because that's not even a thing).

In D46858#1098865, @jfb wrote:

(Will address code changes separately)

I wonder if there's some way to write a test for this. E.g., a unit test that registers a signal handler for SIGUSR1 and then does what used to be necessary to trigger a crash.

Hmm I don't think so, at least not a reliable test: most of the time the test wouldn't hit a dangerous interleaving. We could try to make the bad concurrent interleaving more likely, or loop a bunch so it's more likely to hit the race because it tries more, but I think fundamentally it would be a flaky "success" test, and mostly a test that does nothing. Were the code to start being broken, it would become a flaky failure.

Could you manufacture a deterministic (and dangerous) interleaving, by having the signal handler raise a signal on itself partway through something?

jfb added a comment.May 14 2018, 9:29 PM

I committed the two requests NFC changes. Will address other changes tomorrow.

In D46858#1098865, @jfb wrote:

(Will address code changes separately)

I wonder if there's some way to write a test for this. E.g., a unit test that registers a signal handler for SIGUSR1 and then does what used to be necessary to trigger a crash.

Hmm I don't think so, at least not a reliable test: most of the time the test wouldn't hit a dangerous interleaving. We could try to make the bad concurrent interleaving more likely, or loop a bunch so it's more likely to hit the race because it tries more, but I think fundamentally it would be a flaky "success" test, and mostly a test that does nothing. Were the code to start being broken, it would become a flaky failure.

Could you manufacture a deterministic (and dangerous) interleaving, by having the signal handler raise a signal on itself partway through something?

I don't think that'll do what we want: we want regular code to raise at a dangerous spot. We *can* raise from the signal handler, but that's not the race that was causing us trouble.

In D46858#1098875, @jfb wrote:

I committed the two requests NFC changes. Will address other changes tomorrow.

SGTM.

In D46858#1098865, @jfb wrote:

(Will address code changes separately)

I wonder if there's some way to write a test for this. E.g., a unit test that registers a signal handler for SIGUSR1 and then does what used to be necessary to trigger a crash.

Hmm I don't think so, at least not a reliable test: most of the time the test wouldn't hit a dangerous interleaving. We could try to make the bad concurrent interleaving more likely, or loop a bunch so it's more likely to hit the race because it tries more, but I think fundamentally it would be a flaky "success" test, and mostly a test that does nothing. Were the code to start being broken, it would become a flaky failure.

Could you manufacture a deterministic (and dangerous) interleaving, by having the signal handler raise a signal on itself partway through something?

I don't think that'll do what we want: we want regular code to raise at a dangerous spot. We *can* raise from the signal handler, but that's not the race that was causing us trouble.

I wondered if that would be a similar enough race that it would require the same sort of careful work, and thus could protect against "simplifying" the code later. But if that won't do it I suppose we'll just have to trust people to read comments.

dexonsmith accepted this revision.May 14 2018, 9:36 PM

Since you've answered my question about testing, this LGTM after you address the other feedback. Thanks for splitting out the two commits.

This revision is now accepted and ready to land.May 14 2018, 9:36 PM
jfb updated this revision to Diff 146744.May 14 2018, 10:16 PM
jfb marked 6 inline comments as done.
  • Rebase, address comments.
jfb requested review of this revision.May 14 2018, 10:18 PM

Actually I had a bit of time and addressed all outstanding comments from @dexonsmith.

On de-registering signal handlers at llvm_shutdown time: how about I post a separate patch for it? Seems like it could be good, but I want to have a focused patch to do so where we can have other chime in on the effect it could have.WDYT @steven_wu?

Leaving this open for a bit in case other want to chime in.

jfb updated this revision to Diff 146889.May 15 2018, 11:27 AM
  • Misc updates, cleanup
This revision was not accepted when it landed; it landed in state Needs Review.May 15 2018, 9:33 PM
This revision was automatically updated to reflect the committed changes.
jfb reopened this revision.May 15 2018, 9:45 PM

Had to revert because some bots don't have double-pointer-wide atomics.

jfb updated this revision to Diff 147114.May 16 2018, 9:30 AM
  • Fix double-wide CAS
lib/Support/Signals.cpp
53 ↗(On Diff #147114)

This is the update, avoids double-wide CAS by just using this flag.

dexonsmith accepted this revision.May 16 2018, 10:01 AM

I found a couple more nits to pick while reading the new version of the lock-free code. LGTM once you fix those.

lib/Support/Signals.cpp
64 ↗(On Diff #147114)

Reverse condition and early continue to reduce nesting?

80 ↗(On Diff #147114)

Reverse condition and early continue to reduce nesting?

87 ↗(On Diff #147114)

The optimizer is free to remove this loop exit. I think report_fatal_error would be more appropriate in this case.

lib/Support/Unix/Signals.inc
133–134 ↗(On Diff #147114)

Early continues to reduce nesting?

This revision is now accepted and ready to land.May 16 2018, 10:01 AM
jfb updated this revision to Diff 147126.May 16 2018, 10:27 AM
jfb marked 5 inline comments as done.
  • Address comments
jfb added a comment.May 16 2018, 10:28 AM

Comments addressed.

jfb accepted this revision.May 16 2018, 10:28 AM
jfb closed this revision.

Update landed in r332496.