This is an archive of the discontinued LLVM Phabricator instance.

Fix a race condition in support library ThreadPool
ClosedPublic

Authored by jhen on Apr 5 2016, 5:51 PM.

Details

Summary

By running TSAN on the ThreadPool unit tests it was discovered that the
threads in the pool can pop tasks off the queue at the same time the "wait"
routine is trying to check if the task queue is empty. This patch fixes this
problem by checking for active threads in the waiter before checking
whether the queue is empty.

Diff Detail

Repository
rL LLVM

Event Timeline

jhen updated this revision to Diff 52754.Apr 5 2016, 5:51 PM
jhen retitled this revision from to Fix a race condition in support library ThreadPool By running TSAN on some code using the ThreadPool it was discovered that the threads in the pool can pop tasks off the queue at the same time the "wait" routine is trying to check if the task....
jhen updated this object.
jhen added a reviewer: jlebar.
jhen added a subscriber: llvm-commits.
jhen retitled this revision from Fix a race condition in support library ThreadPool By running TSAN on some code using the ThreadPool it was discovered that the threads in the pool can pop tasks off the queue at the same time the "wait" routine is trying to check if the task... to Fix a race condition in support library ThreadPool.Apr 5 2016, 5:55 PM
jhen updated this object.
jlebar edited edge metadata.Apr 5 2016, 6:20 PM

Apologies for scope creep here. If you want to save the bigger comments for another patch, that's fine. But I think most of these need to be addressed one way or another (not necessarily by you, I suppose).

The atomic ActiveThreads niggles me; it may be worth cc'ing the original author here to make sure it's not a Chesterton's Fence. I think this code is much more correct, though. :)

include/llvm/Support/ThreadPool.h
130 ↗(On Diff #52754)

Maybe worth adding comments here indicating which variables are guarded by the mutex, so we are less likely to have the bug we previously had? I'd also move the definition of Mutex and the condvars above whatever they protect.

It would also be nice to have a testcase which illustrates the bug (even if only under tsan), but I don't know how hard that would be.

lib/Support/ThreadPool.cpp
39 ↗(On Diff #52754)

Hmm, this means that if there are tasks in the queue when we run the destructor, we skip them. That is, we only block until we complete the currently-running tasks.

At least this should be documented in the destructor comment; it's not clear to me from that comment that this is our behavior. But I'm also not sure it's desirable, as can leave us with futures that will never be completed. Specifically, if you have threads waiting on futures, and then you destroy the threadpool without calling wait() on the threadpool first, those waiting threads will never wake up.

Since the destructor has to wait for running threads to complete *anyway*, clearing out the whole task queue seems pretty reasonable to me.

47 ↗(On Diff #52754)

The order is no longer relevant, as everything is guarded by a single mutex. I think we can nix this whole comment?

60 ↗(On Diff #52754)

Not sure this comment is necessary; seems pretty clear what's going on.

88 ↗(On Diff #52754)

This is interesting, as it means that you're not allowed to enqueue new tasks from a task T, if it's possible that T is still running when we destroy the threadpool. Seems like it's worth a comment at least. I think the right behavior probably is to explicitly not allow new tasks after we've begun shutting down -- maybe we return a null future? Can we do that?

Threadpool shutdown is always so complicated...

104 ↗(On Diff #52754)

May be worth #ifndef NDEBUG asserting here that ActiveThreads is 0.

jhen updated this revision to Diff 52848.Apr 6 2016, 1:58 PM
jhen marked 6 inline comments as done.
jhen edited edge metadata.
  • Respond to jlebar's comments
include/llvm/Support/ThreadPool.h
130 ↗(On Diff #52754)

Done for the comments and rearranging.

The current tests show the bug under tsan, but I can't think of a way to get it to reliably manifest itself otherwise.

lib/Support/ThreadPool.cpp
39 ↗(On Diff #52754)

It looks to me like it's not actually leaving any tasks in the queue. The statement below is the only way the thread can terminate from within, and it checks for Tasks.empty() before returning. If there are still tasks in the queue, the thread will grab one and run it regardless of whether the EnableFlag is set.

88 ↗(On Diff #52754)

I think this is OK because the only way to destroy the ThreadPool is to call its destructor. If anyone is calling the method to enqueue work after the destructor was called, then I would say that is a bug in their program, not in this ThreadPool, and I think this assert failure is appropriate. If a method is added one day to destroy the pool without calling its destructor, then I could see this being a problem, so I changed the name of the EnableFlag to InDestructorFlag to make anybody think twice about setting it if they are not in the destructor.

mehdi_amini edited edge metadata.Apr 6 2016, 2:47 PM

One lock is simpler to understand, however it adds unnecessary contention, which is what I was trying to avoid originally.

include/llvm/Support/ThreadPool.h
130 ↗(On Diff #52848)

Is this related to the changes here? If not then split it out to a separate NFC patch, reducing the change in this patch to the bug it is actually fixing. Same for renaming QueueCondition to SubmissionCondition.

lib/Support/ThreadPool.cpp
79 ↗(On Diff #52848)

Would just reversing the check solve the race?

[&] { return !ActiveThreads && Tasks.empty(); });

If not can you elaborate a little?

83–84 ↗(On Diff #52848)

I agree with jlebar that this scenario is sketchy:

{ 
  ThreadPool Pool;
  Pool.async([&] () {
     if (random()) Pool.async(doSomething);
  });
}

Technically it means that the client has to insert an explicit wait if a task can enqueue:

{ 
  ThreadPool Pool;
  Pool.async([&] () {
     if (random()) Pool.async(doSomething);
  });
  Pool.wait();
}

Synchronizing on a future inside a task would have the same kind of issue as well.

mehdi_amini added inline comments.Apr 6 2016, 2:52 PM
lib/Support/ThreadPool.cpp
65 ↗(On Diff #52848)

To answer the comment about why ActiveThreads was atomic: it is updated here guarded by CompletionLock, but was incremented a few lines above guarded by QueueLock.

jlebar added inline comments.Apr 6 2016, 2:53 PM
lib/Support/ThreadPool.cpp
79 ↗(On Diff #52848)

The race is that we touch Tasks here while holding only CompletionLock, while another thread in asyncImpl may touch Tasks while holding only QueueLock.

mehdi_amini added inline comments.Apr 6 2016, 2:56 PM
lib/Support/ThreadPool.cpp
79 ↗(On Diff #52848)

Mmmm the description of the revision was not about *enqueuing* like you suggest, but about poping.
I need another look then.

mehdi_amini added inline comments.Apr 6 2016, 3:02 PM
lib/Support/ThreadPool.cpp
79 ↗(On Diff #52848)

The use case you're mentioning, i.e. waiting for the pool to finish in one thread, while another thread would enqueue to the pool is explicitly forbidden by the API: see the wait() documentation: It is an error to try to add new tasks while blocking on this call.

jlebar added a comment.Apr 6 2016, 3:09 PM

One lock is simpler to understand, however it adds unnecessary contention, which is what I was trying to avoid originally.

My preference is that we write simple, correct code and worry about performance separately, when we have explicit use-cases, benchmarks, etc. Otherwise we're just prematurely optimizing. Does that sound OK to you?

include/llvm/Support/ThreadPool.h
130 ↗(On Diff #52848)

This feels pretty unnecessary to me. We're already changing a bunch of things in this class; who cares if we rename one additional variable in this patch as opposed to in a separate one? All of these changes are related in that they're attempting to make it less likely that this class will have bugs in the future.

There's a cost to carrying small patches in your queue that are tightly intermingled with a base patch; every time you update the base patch, you have to rebase the dependent patch.

I like small patches too, but there is a line. :)

lib/Support/ThreadPool.cpp
39 ↗(On Diff #52848)

Oh, indeed, you're right; it's &&, not ||. Carry on. :)

jhen added a comment.Apr 6 2016, 3:25 PM

To elaborate on the race condition, TSAN warns about the following when running the AsyncBarrier unit test (irrelevant stack frames removed):

WARNING: ThreadSanitizer: data race (pid=690669)
  Read of size 8 at 0x7fff3627b2b8 by main thread (mutexes: write M1928):
    wait<(lambda at lib/Support/ThreadPool.cpp:79:28)>
  Previous write of size 8 at 0x7fff3627b2b8 by thread T5 (mutexes: write M1849):
    operator() lib/Support/ThreadPool.cpp:53

So basically the wait routine can be checking whether the task queue is empty at the same time a thread can be trying to remove an element from the task queue. The queue data structure is not thread safe, so this is a race.

Sorry for the misleading description. I hope this clears up the issue.

My preference is that we write simple, correct code and worry about performance separately, when we have explicit use-cases, benchmarks, etc. Otherwise we're just prematurely optimizing. Does that sound OK to you?

Well, one can use exactly the same sentence to argue against threading at all in the first place :)
This implementation is pretty dumb anyway and I don't expect it to scale, it is convenient for coarse grain tasks, so I am indeed not really worried about the contention part here.
There is always a tradeoff to be acknowledged, I wanted to make it clear here, because I felt it wasn't from the original description.
(i.e. threads will now take the lock and release it both when popping from the queue, and when the task is completed).

include/llvm/Support/ThreadPool.h
130 ↗(On Diff #52848)

This does not seems unnecessary to me, and this is in line with the request I usually receive on my patches. git blame and git show are a lot more friendly to analyze why the code is the way this way. As we say written once, read many...
(also minimizing diff helps the review)

As of the cost you're mentioning, you don't have to carry small patches in your queue as you can commit the NFC/renaming change immediately without going through any review.
It seems totally "free" to me so I don't really see why not doing it?

jhen: can you try reversing the condition as I suggested and verify if TSAN still complains (I'm still trying to make sure I have a good understanding of the existing logic...)

jhen added a comment.Apr 6 2016, 3:29 PM

To clarify further. The race can occur without submitting tasks while waiting, there just has to be a task in the queue at the time the wait is called. That task would have been enqueued before the call to wait.

jlebar added inline comments.Apr 6 2016, 3:34 PM
include/llvm/Support/ThreadPool.h
130 ↗(On Diff #52848)

As we say written once, read many...

Surely that does not apply to blame anywhere close to the degree it applies to code itself.

Since code is read so much more often than blame, we should encourage and welcome cleanups, making them easy for the contributor. As we do, with post-commit review.

In this case, Jason already is carrying the rename in his tree. You're asking him to either

a) split it out and commit the rename after he's submitted this patch, in which case he has to carry the changes through review and rebase, as I described, or

b) commit the changes now, before this patch is ready, in which case they may not end up being necessary, depending on the direction this review takes. (Who knows, maybe we remove this variable.) Committing unnecessary changes adds noise to the mailing list and wastes everyone's time.

In either case it's not free.

jhen added a comment.Apr 6 2016, 3:44 PM

joker.eph: You are right, the TSAN error goes away if the order of the conditions on line 79 is swapped. I'll update this patch to make only that change.

jhen updated this revision to Diff 52862.Apr 6 2016, 3:53 PM
jhen edited edge metadata.
  • Simplify the change to one line
jhen added a comment.Apr 6 2016, 3:53 PM

OK, I've simplified the change to just the single line suggested by joker.eph and the race condition is now cleared up.

mehdi_amini accepted this revision.Apr 6 2016, 3:56 PM
mehdi_amini edited edge metadata.

I'm fine with this revision, of course.

However the original work you did is still valuable I think, making the code much simpler.
If you'd like, I would not oppose to a separate clean-up "NFC" patch that mentions the tradeoff on the locking in favor of a more understandable and maintainable code.

This revision is now accepted and ready to land.Apr 6 2016, 3:56 PM
jhen added a comment.Apr 6 2016, 4:03 PM

Thanks for your help jlebar and joker.eph.

jhen updated this object.Apr 6 2016, 4:46 PM
jhen edited edge metadata.
This revision was automatically updated to reflect the committed changes.