This is an archive of the discontinued LLVM Phabricator instance.

Fix memory leak in CreateSigAltStack
Needs ReviewPublic

Authored by fez on Mar 5 2019, 4:00 PM.

Details

Reviewers
jfb
Summary

In lib/Support/Unix/Signals.inc there is an occasional memory leak, whenever
the function CreateSigAltStack is called multiple times in certain conditions.

To reproduce the bug, just run

valgrind --leak-check=full opt -o /dev/null /dev/null

OS: Ubuntu 18.04
Valgrind version: 3.13.0
llvm version: 7.0.0 (opt-7 system installation in ubuntu, but for what I saw in the sources it affects llvm 8.0.0 as well)
You should be on a Unix system with sigaltstack available.

This is the leak report I get from Valgrind

$ valgrind --leak-check=full opt-7 -o /dev/null /dev/null
==27771== Memcheck, a memory error detector
==27771== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==27771== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==27771== Command: opt-7 -o /dev/null /dev/null
==27771==
==27771==
==27771== HEAP SUMMARY:
==27771==     in use at exit: 67,640 bytes in 3 blocks
==27771==   total heap usage: 3,488 allocs, 3,485 frees, 1,069,258 bytes
allocated
==27771==
==27771== 67,584 bytes in 1 blocks are definitely lost in loss record 3 of 3
==27771==    at 0x4C2FB0F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==27771==    by 0x59622D3: ??? (in /usr/lib/x86_64-linux-gnu/libLLVM-7.so.1)
==27771==    by 0x5909CED: llvm::EnablePrettyStackTrace() (in
/usr/lib/x86_64-linux-gnu/libLLVM-7.so.1)
==27771==    by 0x58F2C21: llvm::InitLLVM::InitLLVM(int&, char const**&) (in
/usr/lib/x86_64-linux-gnu/libLLVM-7.so.1)
==27771==    by 0x1B43F7: main (in /usr/lib/llvm-7/bin/opt)
==27771==
==27771== LEAK SUMMARY:
==27771==    definitely lost: 67,584 bytes in 1 blocks
==27771==    indirectly lost: 0 bytes in 0 blocks
==27771==      possibly lost: 0 bytes in 0 blocks
==27771==    still reachable: 56 bytes in 2 blocks
==27771==         suppressed: 0 bytes in 0 blocks
==27771== Reachable blocks (those to which a pointer was found) are not shown.
==27771== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==27771==
==27771== For counts of detected and suppressed errors, rerun with: -v
==27771== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

I actually tracked it down in a version compiled from source with all the debug
symbols. The leak report is very similar, except that I get the exact info on
the leaking allocation:

==20704== HEAP SUMMARY:
==20704==     in use at exit: 1,020,675 bytes in 35 blocks
==20704==   total heap usage: 292,536 allocs, 292,501 frees, 49,368,073 bytes
allocated
==20704==
==20704== 67,584 bytes in 1 blocks are definitely lost in loss record 14 of 27
==20704==    at 0x4C2FB0F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==20704==    by 0x893D63C: safe_malloc (MemAlloc.h:27)
==20704==    by 0x893D63C: CreateSigAltStack (Signals.inc:254)
==20704==    by 0x893D63C: RegisterHandlers() (Signals.inc:277)
==20704==    by 0x893DF9C: llvm::sys::AddSignalHandler(void (*)(void*), void*)
(Signals.inc:386)
==20704==    by 0x88E538D: RegisterCrashPrinter (PrettyStackTrace.cpp:180)
==20704==    by 0x88E538D: llvm::EnablePrettyStackTrace()
(PrettyStackTrace.cpp:188)
==20704==    by 0x88CFFED: PrettyStackTraceProgram (PrettyStackTrace.h:77)
==20704==    by 0x88CFFED: llvm::InitLLVM::InitLLVM(int&, char const**&)
(InitLLVM.cpp:25)
==20704==    by 0x41DF07: InitLLVM (InitLLVM.h:35)
==20704==    by 0x41DF07: main (opt.cpp:426)

The leak shows up whenever CreateSigAltStack is called multiple times, in some
specific conditions.

The first time CreateSigAltStack is called, a new alternate signal stack
isallocated with

AltStack.ss_sp = static_cast<char *>(safe_malloc(AltStackSize));

Then the value of AltStack.ss_sp is saved into NewAltStackPointer. According
to the comments, this is a workaround to stop valgrind or asan complaining about
a leak, but it's not enough. Here's why. Whenever CreateSigAltStack is called
a second time and it reaches past the first if, a new larger alternate stack is
allocated again with

AltStack.ss_sp = static_cast<char *>(safe_malloc(AltStackSize));

But at this point, the NewAltStackPointer is overwritten, and the old
alternate signal stack is never free'd, and we don't hold a pointer to it
anymore, causing the leak.

With this fix the leak should go away. The valgrind leak report report goes away
as well.

Event Timeline

fez created this revision.Mar 5 2019, 4:00 PM
Herald added a project: Restricted Project. · View Herald Transcript
fez added inline comments.Mar 5 2019, 4:08 PM
llvm/lib/Support/Unix/Signals.inc
255

The leak shows up here, whenever the execution reaches this point two times in a single execution. The first time is fine, but the second time the first alternate signal stack leaks, because free is not called before reassigning NewAltStackPointer.

jfb added a comment.Mar 6 2019, 12:55 PM

This seems wrong. How is this point being reached twice? It's called from here:

static void RegisterHandlers() { // Not signal-safe.
  // The mutex prevents other threads from registering handlers while we're
  // doing it. We also have to protect the handlers and their count because
  // a signal handler could fire while we're registeting handlers.
  static ManagedStatic<sys::SmartMutex<true>> SignalHandlerRegistrationMutex;
  sys::SmartScopedLock<true> Guard(*SignalHandlerRegistrationMutex);

  // If the handlers are already registered, we're done.
  if (NumRegisteredSignals.load() != 0)
    return;

  // Create an alternate stack for signal handling. This is necessary for us to
  // be able to reliably handle signals due to stack overflow.
  CreateSigAltStack();

Is the mutex wrong? It should ensure that the alt stack is only ever set up once.

fez added a comment.Mar 6 2019, 1:10 PM

Is the mutex wrong? It should ensure that the alt stack is only ever set up once.

The mutex is fine, but it only guarantees that only one thread at any given time enters CreateSigAltStack. Nothing prevents the entire RegisterHandlers function to be called multiple times, by a single thread, which causes the execution to reach line 255 multiple times in a single execution of the same thread. This is actually exactly what's happening in my debug runs. It's not a concurrency issue (which would be effectively prevented by the mutex), it's just a memory leak in a single thread.

fez marked an inline comment as done.EditedMar 6 2019, 1:29 PM

But now I'm puzzled by this check.

// If the handlers are already registered, we're done.
if (NumRegisteredSignals.load() != 0)
  return;

This should prevent the call to CreateSigAltStack to be reached the second time.

I need to investigate further.

jfb added a comment.Mar 6 2019, 2:11 PM
In D59005#1420436, @fez wrote:

But now I'm puzzled by this check.

// If the handlers are already registered, we're done.
if (NumRegisteredSignals.load() != 0)
  return;

This should prevent the call to CreateSigAltStack to be reached the second time.

I need to investigate further.

Indeed, that's why I was confused :)

jfb added a comment.Mar 6 2019, 2:15 PM

And NumRegisteredSignals uses sequentially-consistent operations on purpose: it's technically slower but way easier to understand. What could be happening is:

// The signal handler that runs.
static RETSIGTYPE SignalHandler(int Sig) {
  // Restore the signal behavior to default, so that the program actually
  // crashes when we return and the signal reissues.  This also ensures that if
  // we crash in our signal handler that the program will terminate immediately
  // instead of recursing in the signal handler.
  UnregisterHandlers();

if you deliver another signal after unregistering all of them, you get the behavior you observed? I'm not sure whether we care about clang-as-a-library being loaded / unloaded in the same program. If not, we could limit signal registration to be a one-time thing.

fez added a comment.Mar 6 2019, 2:49 PM

if you deliver another signal after unregistering all of them, you get the behavior you observed? I'm not sure whether we care about clang-as-a-library being loaded / unloaded in the same program. If not, we could limit signal registration to be a one-time thing.

That's right. That is exactly what happened when I first stumbled on the problem.

I tried to simplify it to a minimal reproducible example with Valgrind. Here's how I realized that even running valgrind --leak-check=full opt -o /dev/null /dev/null reported the leak, whereas with my fix the error report was going away (BTW have you had the chance to try it? It runs in just a few seconds even with Valgrind).

I think there are two things to point out about this situation.

  • With the suggested fix the error report from Valgrind goes away _and_ also the memory leak with register/unregister goes away, which means that yes, you could prevent a leak if someone has clang-as-a-library loaded/unloaded in the same program (even if I admit that is not something many people are going to do, even in my case I ended up doing something else).
  • Valgrind is reporting a false positive on the memory leak with valgrind --leak-check=full opt -o /dev/null /dev/null. I just double-checked, with that command line RegisterHandlers is being called only once, which means that there isn't actually any memory leak. Anyway, the fix makes the false positive effectively go away, which I think is a nice thing because in this way you get clean runs with Valgrind.

Both things might not be all that important, but the patch is pretty self-contained and I don't see any downsides.

What do you think?

jfb added inline comments.Mar 6 2019, 2:59 PM
llvm/lib/Support/Unix/Signals.inc
239

This comment leads me to believe that the leak is quite intentional. What happens if you free the alt stack while it's being used by another signal handler?

255

IIUC you need to trigger two signals? How does this happen? What are the signals that trigger the leak? Do they happen concurrently?

Maybe the answer is to hold on to the alt stacks, and free them only when there's no pending signal (or not free them, ever). That seems error-prone, and the point of my refactoring was to reduce the number of existing errors in the signal handler code (which was pretty high before).

Before we do something I think we need to understand exactly what triggers this.

fez marked an inline comment as done.Mar 6 2019, 4:18 PM
fez added inline comments.
llvm/lib/Support/Unix/Signals.inc
239

Good point.

For this to happen we need to assume we're are in a multi-threaded scenario, with more than one thread trying to register signal handlers and alt stacks for them.

If we assume such a scenario, then at the moment there actually is a memory leak:

  • thread 0 calls CreateSigAltStack, and reserves a new alt stack, storing a pointer to it in the static void *NewAltStackPointer.
  • later (never at the same time because of the mutex) thread 1 calls CreateSigAltStack, and reserves another new separate alt stack, storing a pointer to it again in NewAltStackPointer, which is static, hence shared among all threads. This shared static global gets overwritten by thread 1, effectively leaking the alt stack of thread 0.

The easiest fix I can think of, is to change OldAltStack and NewAltStackPointer to be thread_local static instead of just static. In this way we avoid both the leak (because thread 1 does not overwrite the NewAltStackPointer version of thread 0) and the race condition caused by one thread to free the alt stack of another thread (because each thread only allocates and frees its own thread_local alt stack, pointed-to by the thread_local version of NewAltStackPointer).

With this change, we could even think of moving the lock in RegisterHandlers right before

for (auto S : IntSigs)
  registerHandler(S);

This should be fine, because with CreateSigAltStack only manipulating local and thread_local variables there should be no race condition anymore in CreateSigAltStack.

fez marked an inline comment as done.Mar 6 2019, 4:38 PM
fez added inline comments.
llvm/lib/Support/Unix/Signals.inc
255

IIUC you need to trigger two signals? How does this happen? What are the signals that trigger the leak? Do they happen concurrently?

No, the leak is not triggered by incoming signals, but by the fact that CreateSigAltStack is called multiple times. But because of the check if (NumRegisteredSignals.load() != 0) this can never actually happen in a multi threaded program (and neither in a single threaded one), unless for some reasons the signal handlers are initially registered (causing NumRegisteredSignals to be != 0), then unregistered (causing NumRegisteredSignals to drop to 0), and then re-registered again (because after the de-registration, the check on NumRegisteredSignals.load != 0 fails a second time).

I realize this is an extremely unlikely scenario.

At this point the only immediate advantage I see is just cleaning up Valgrind's report from this subtle false positive, which I still think is a nice thing.

I think there are three options:

  • drop the patch, which leaves Valgrind complaining and effectively leaks in case of register/unregister/re-register;
  • accept it like it is now, which breaks multi-threaded programs if a thread frees another's alt stack like you pointed out;
  • consider the implications of my proposal above about thread_local, which should both fix Valgrind's false positive memory leak report, leave the multi-threaded scenario safe, and also support the register/unregister/re-register scenario (both for single- and multi-threaded).

What do you think?

fez marked 2 inline comments as not done.Mar 6 2019, 4:38 PM
jfb added inline comments.Mar 7 2019, 9:10 AM
llvm/lib/Support/Unix/Signals.inc
325

I think this is the root of your issue: the signal handlers are de-registered with the expectation that LLVM will never handle signals ever again. Clearly that's not true. What signals are being handled? i.e. to get the bug you're seeing, can you log Sig every time it occurs. What are they? Maybe we're installing non-fatal handlers (that's fine!) but then we can only unregister handlers for fatal signals, not for non-fatal ones.