This is an archive of the discontinued LLVM Phabricator instance.

Use rw locks for sanitizer thread registry
AbandonedPublic

Authored by fjricci on Feb 6 2017, 9:15 AM.

Details

Summary

When StopTheWorld is used to checks for leaks, one thread acquires the
mutex on the thread registry in CheckForLeaks, but a newly spawned thread
accesses the registry in CheckForLeaksCallback. Sharing a mutex access
across threads like this is not supported by blocking mutex,
but is supported by read-only locks in rw mutexes.

In order to allow for read-only access by multiple threads, this patch
changes the lock on the thread registry to a rw mutex.

Event Timeline

fjricci created this revision.Feb 6 2017, 9:15 AM
kubamracek edited edge metadata.Feb 6 2017, 9:37 AM

Please factor out the renaming of GetTid to internal_gettid into a separate patch.

And also please factor out the validation of tid. These 3 things are independent separate fixes and should be separate patches:

  1. The extra validation in BlockingMutex and RWMutex.
  2. Renaming of GetTid.
  3. Changing the mutex of ThreadRegistry to be RWMutex.
fjricci planned changes to this revision.Feb 6 2017, 10:04 AM

Will split.

fjricci updated this revision to Diff 87264.Feb 6 2017, 10:48 AM
fjricci edited the summary of this revision. (Show Details)
fjricci removed a reviewer: kubamracek.

Splits out move to rw locks from related changes

Accidentally removed a reviewer, re-adding.

kcc edited edge metadata.Feb 6 2017, 10:53 AM

This is a very invasive change with a potential to break lots of things in subtle ways.
It has more chances to get reviewed and accepted if you

a) test it on something very big (and tell us what you did) and
b) add a test that fails now and passes with this change.

Also, please make sure to have LGTM from dvyukov before committing.

dvyukov edited edge metadata.Feb 7 2017, 2:01 AM

Sharing a mutex access across threads like this is not supported by blocking mutex

Why?
I've checked BlockingMutex comments and linux/mac/win implementation. I don't see any indication that it's not supported.
How is RWMutex different?
How does the problem manifest? What is the scenario that demonstrated the problem?

Spin mutex can behave very poorly on contention, and thread registry can be contended. I suspect blocking mutex is there on purpose.

Sharing a mutex access across threads like this is not supported by blocking mutex

Why?

If a thread owns a mutex on the thread registry for writing (as with a blocking mutex), we can't really assume in another thread that the registry won't change. Since the thread accessing the data is actually in a separate process (the result of a clone() in StopTheWorld), I'm not sure that there's even a way to pass the blocking mutex around safely.

I've checked BlockingMutex comments and linux/mac/win implementation. I don't see any indication that it's not supported.
How is RWMutex different?

An RWMutex acquired for read-only use can be safely used for reads by multiple threads, as it guarantees the protected data won't change. A blocking mutex provides no guarantees that the protected state won't change (and in fact implies that it probably will).

How does the problem manifest? What is the scenario that demonstrated the problem?

There are two pieces to this. The primary bit is just that adding checks to ensure that CheckLocked() only succeeds if the calling thread is the thread that owns the mutex will fail, as demonstrated by a cherry-pick of D29594. The other bit is consistency. Since these checks already exist on darwin and windows, the existing code will fail if and when we decide to add support for lsan to those platforms. This gives essentially two options - one would be removing the validation checks from the darwin and windows blocking mutex implementations, so that they succeed in the same way that linux currently does, by no longer forcing CheckLocked() to fail if another thread owns the mutex. The other would be the current method of adding strict validation checks to the linux implementation, which requires moving away from a blocking mutex.

Spin mutex can behave very poorly on contention, and thread registry can be contended. I suspect blocking mutex is there on purpose.

In terms of improving on spin mutex, I had originally used the pthreads rw lock implementation. However, this adds some libc dependencies which I wasn't sure we wanted.

Sharing a mutex access across threads like this is not supported by blocking mutex

Why?

If a thread owns a mutex on the thread registry for writing (as with a blocking mutex), we can't really assume in another thread that the registry won't change. Since the thread accessing the data is actually in a separate process (the result of a clone() in StopTheWorld), I'm not sure that there's even a way to pass the blocking mutex around safely.

I've checked BlockingMutex comments and linux/mac/win implementation. I don't see any indication that it's not supported.
How is RWMutex different?

An RWMutex acquired for read-only use can be safely used for reads by multiple threads, as it guarantees the protected data won't change. A blocking mutex provides no guarantees that the protected state won't change (and in fact implies that it probably will).

Sorry, I am not following you.
How can protected data change when a thread holds the mutex and is not changing the data? Who/how can change the data?
Whoever wants to change the data must acquire the mutex first, but it won't because the parent process holds the mutex.
What do you mean y "passing blocking mutex around"? And why do we need to do it?

The parent process holds the mutex thus ensuring that nobody else is changing the data.

Sharing a mutex access across threads like this is not supported by blocking mutex

Why?

If a thread owns a mutex on the thread registry for writing (as with a blocking mutex), we can't really assume in another thread that the registry won't change. Since the thread accessing the data is actually in a separate process (the result of a clone() in StopTheWorld), I'm not sure that there's even a way to pass the blocking mutex around safely.

I've checked BlockingMutex comments and linux/mac/win implementation. I don't see any indication that it's not supported.
How is RWMutex different?

An RWMutex acquired for read-only use can be safely used for reads by multiple threads, as it guarantees the protected data won't change. A blocking mutex provides no guarantees that the protected state won't change (and in fact implies that it probably will).

Sorry, I am not following you.
How can protected data change when a thread holds the mutex and is not changing the data? Who/how can change the data?
Whoever wants to change the data must acquire the mutex first, but it won't because the parent process holds the mutex.
What do you mean y "passing blocking mutex around"? And why do we need to do it?

The parent process holds the mutex thus ensuring that nobody else is changing the data.

Correct. We know that the data won't change in this particular case, because we know the parent isn't changing the data, and we know that our parent is holding the lock. However, this isn't what CheckLocked() is checking. CheckLocked() checks that *any* thread holds the lock, it doesn't need to be our parent to pass the check. That thread which is holding the lock could very well be changing the data. Even our parent could, in theory, be changing the data.

Correct. We know that the data won't change in this particular case, because we know the parent isn't changing the data, and we know that our parent is holding the lock. However, this isn't what CheckLocked() is checking. CheckLocked() checks that *any* thread holds the lock, it doesn't need to be our parent to pass the check.

I still don't get the problem.
How is CheckLocked related? I don't see any CheckLocked calls in lsan code.

That thread which is holding the lock could very well be changing the data. Even our parent could, in theory, be changing the data.

Um... well, but even the current thread can, in theory, hold the mutex and still wrongly change the data when it should not.
The parent is holding the mutex preventing all other threads from changing the data, and it should not change the data itself and it is not changing the data. Looks good to me. If parent will start changing data, we are in trouble. But there are zillion of possibilities of how we can end up in trouble if people will start randomly changing correct code to incorrect code.

Correct. We know that the data won't change in this particular case, because we know the parent isn't changing the data, and we know that our parent is holding the lock. However, this isn't what CheckLocked() is checking. CheckLocked() checks that *any* thread holds the lock, it doesn't need to be our parent to pass the check.

I still don't get the problem.
How is CheckLocked related? I don't see any CheckLocked calls in lsan code.

CheckLocked() is called via: __lsan::CheckForLeaksCallback() -> __lsan::GetThreadRangesLocked() -> __sanitizer::ThreadRegistry::FindThreadContextByOsIDLocked() -> CheckLocked()

I think I should probably do a better job of explaining the motivations behind these two changes.

BlockingMutex::CheckLocked() is part of an OS-independent API, called primarily from OS-independent code. To me, this means that its behavior should be the same regardless of platform. Currently this is not the case. If a platform-independent function calls CheckLocked(), its behavior will be different depending on the OS. So in practice, when FindThreadContextByOsIDLocked() calls CheckLocked(), if the code is running on linux, the check will pass as long as *any thread* holds the mutex. On mac or windows, the check will only pass if the *calling thread* holds the mutex.

Essentially, the problem I'm trying to solve is solidifying the expected behavior of CheckLocked(). If we think that the mac and windows implementations are representative of the correct API, then we should also only pass the check if the calling thread holds the mutex on linux, which requires these two changes. If we think the linux implementation is representative, then on mac and windows, we should pass as long as any thread holds the mutex, and not fail if the current thread isn't the holder. I'm happy with either API, although the current stricter approach seemed more correct to me, due to the reasons I've attempted to outline in previous comments.

It would probably be productive to decide what we want the general behavior of CheckLocked() to be. From there, we can decide whether we want to use this patch or a different fix.

The contract for CheckLocked is: CheckLocked does nothing if the current thread holds the mutex, otherwise behavior is undefined.
And, yes, we are currently abusing the sloppy check in BlockingMutex on linux in CheckForLeaks.

However, your solution does not look right to me. There are several problems with it:

  1. Besides ThreadRegistry mutex, CheckForLeaks also locks in a similar manner allocator mutexes and dl_iterate_phr mutex. They are no firing currently due to a number of coincidences (e.g. not calling CheckLocked where is should be called, using SpinMutex that has the same sloppy check).
  2. StopTheWorld is also used in tsan and subject to all the same problems. We can't just start changing all mutexes to RWMutex to please the stricter check. And finally:
  3. RWMutex has the same sloppy ownership check for readers as the one you are trying to eliminate in BlockingMutex. When/if we make the check in RWMutex strict (require the calling thread to hold read lock), it will all fall apart because you don't actually fix the problem, you just mask it again abusing the sloppy check in RWMutex.

The root problem here is that StopTheWorld does weird things with mutexes, clone and ptrace. But it does them for a reason. A proper solution to this problem would be: parent grabs all mutexes and starts the child process, child process stops all threads _besides_ the parent thread, child process instructs parent thread to unlock the mutexes, parent thread releases all mutexes and blocks on something, child process grabs all mutexes. But I don't think it's worth it.

I see 2 practical ways to resolve this:

  1. Make all CheckLocked checks sloppy. It's actually not that bad, because we will catch any misuse eventually. It's also simple and fast.
  2. Make all CheckLocked checks strict (including RWMutex and SpinMutex) but introduce a special global of per-thread flag for StopTheWorld that will say "back off, I am doing weird things here, so disable ownership checks for now".

Ok, I think that makes a lot of sense. I did originally think about something similar to the "proper solution" you mentioned, but I came to a similar conclusion about its complexity vs value.

In terms of the 2 solutions you proposed, the first seems quite a bit cleaner, and mirrors the behavior we already have on Linux. I'll go ahead and put up a diff for that, providing nobody has a strong preference for the second.

fjricci abandoned this revision.Feb 9 2017, 11:48 AM

Fixed via D29728