This is an archive of the discontinued LLVM Phabricator instance.

[Support] Simplify and optimize ThreadPool
ClosedPublic

Authored by MaskRay on Apr 24 2020, 10:07 PM.

Details

Summary
  • Merge QueueLock and CompletionLock.
  • Avoid spurious CompletionCondition.notify_all() when ActiveThreads is greater than 0.
  • Use default member initializers.

Diff Detail

Event Timeline

MaskRay created this revision.Apr 24 2020, 10:07 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 24 2020, 10:07 PM

Isn't this technically pessimizing some cases by sharing a lock (and so increasing potential contention here)?

llvm/lib/Support/ThreadPool.cpp
58

Can you expand this over different variables/statements? Or at minima add explicit parentheses? I am sure the compiler parses it, but not me :)

60–63

Update the comment here

MaskRay updated this revision to Diff 260066.Apr 24 2020, 10:47 PM
MaskRay marked 2 inline comments as done.

Address comments

MaskRay added a comment.EditedApr 24 2020, 10:52 PM

Isn't this technically pessimizing some cases by sharing a lock (and so increasing potential contention here)?

No. Only the thread(s) calling ThreadPool::Wait() waits on CompletionCondition. When it gets stuck, it will not awake spuriously. Note, before, we took locks 3 times in the loop body. Now it is 2...
I believe this version is strictly better.

MaskRay updated this revision to Diff 260115.Apr 25 2020, 9:32 AM

Avoid spurious wake-up of waiter

aganea added inline comments.Apr 25 2020, 9:54 AM
llvm/lib/Support/ThreadPool.cpp
58

Is it worth generalizing the notify condition between this and ThreadPool::wait() below, to ease future maintenance/comprehension?

No. Only the thread(s) calling ThreadPool::Wait() waits on CompletionCondition. When it gets stuck, it will not awake spuriously. Note, before, we took locks 3 times in the loop body. Now it is 2...

I wasn't worried about contention with the waiting threads, but between working threads: they are taking the QueueLock more often now (it was taken once or twice per iteration of the loop body before and is now taken an extra time).
I am not sure we'd be able to measure any difference here, just curious how you thought about this tradeoff!

Originally I think I started writing all this code with ActiveThreads being an atomic so that the working threads can increment/decrement it without taking any lock: however adding the "wait()" then I couldn't see how to avoid taking the CompletionLock because of how std::condition_variable is setup.
That said since the waiting thread takes the QueueLock with your change, would leaving the ActiveThreads atomic allow to not take the Queue lock on task completion?
I'm wondering about something like this:

// Notify task completion if this is the last active thread, in case
// someone waits on ThreadPool::wait().
bool Notify = --ActiveThreads == 0;
if (Notify)
  CompletionCondition.notify_all();
llvm/lib/Support/ThreadPool.cpp
58

Can you add parentheses or split the expression?

Notify = (--ActiveThreads == 0) && Tasks.empty();

or

--ActiveThreads;
Notify = !ActiveThreads && Tasks.empty();

The latter avoid the side effect on ActiveThreads in the expression, making the second line "pure" and easier to read.

MaskRay marked an inline comment as done.EditedApr 25 2020, 10:40 AM

No. Only the thread(s) calling ThreadPool::Wait() waits on CompletionCondition. When it gets stuck, it will not awake spuriously. Note, before, we took locks 3 times in the loop body. Now it is 2...

I wasn't worried about contention with the waiting threads, but between working threads: they are taking the QueueLock more often now (it was taken once or twice per iteration of the loop body before and is now taken an extra time).
I am not sure we'd be able to measure any difference here, just curious how you thought about this tradeoff!

There are two stages, before and after the execution of as task. The original code sequence used different locks. I suppose this is what you meant by less contention: a worker in the first stage (holding QueueLock while not holding CompletionLock) cannot contend with a worker in the second stage (holding CompletionLock)..

But note that:

  • there is still a window when a worker in the first stage can hold CompletionLock. The window is small, but it still exists.
  • the spurious signalling of a condition variable (CompletionCondition) can cause contention because a condition variable has an internal lock.

Originally I think I started writing all this code with ActiveThreads being an atomic so that the working threads can increment/decrement it without taking any lock: however adding the "wait()" then I couldn't see how to avoid taking the CompletionLock because of how std::condition_variable is setup.
That said since the waiting thread takes the QueueLock with your change, would leaving the ActiveThreads atomic allow to not take the Queue lock on task completion?
I'm wondering about something like this:

// Notify task completion if this is the last active thread, in case
// someone waits on ThreadPool::wait().
bool Notify = --ActiveThreads == 0;
if (Notify)
  CompletionCondition.notify_all();

The suggested code sequence has the problem of lost signals. It happens because a waiter thread can change the observed state between waiter's testing and blocking.

  worker: pop one item from Tasks and Tasks is now empty
waiter: lock QueueLock
waiter: test that `!ActiveThreads && Tasks.empty()` is false and decide to unlock and block
  worker: decrement ActiveThreads to zero
  worker: signal QueueLock  # the signal is lost
waiter: unlock QueueLock and block

With an atomic ActiveThreads, an alternative code sequence is possible, but I'm not sure it is better:

bool Notify = --ActiveThreads == 0;
{ lock_guard guard(QueueLock); Notify &= Tasks.empty(); }
if (Notify) CompletionCondition.notify_all();
llvm/lib/Support/ThreadPool.cpp
58

Is it worth generalizing the notify condition between this and ThreadPool::wait() below, to ease future maintenance/comprehension?

Sorry that I am not following this suggestion. Can you elaborate?

MaskRay updated this revision to Diff 260125.Apr 25 2020, 10:52 AM

Avoid --ActiveThreads in a complex expression

aganea added inline comments.Apr 25 2020, 11:05 AM
llvm/lib/Support/ThreadPool.cpp
58

I meant a function, to be used by the code here and the code below in wait(), simply to signify (to users who read the code) that it should check exactly the same things?

private: bool workCompletedNoLock() { return !ActiveThreads && Tasks.empty(); }
MaskRay updated this revision to Diff 260126.Apr 25 2020, 11:15 AM
MaskRay marked 2 inline comments as done.

Place !ActiveThreads && Tasks.empty() in a new member function

llvm/lib/Support/ThreadPool.cpp
58

Gotcha. Good idea!

No. Only the thread(s) calling ThreadPool::Wait() waits on CompletionCondition. When it gets stuck, it will not awake spuriously. Note, before, we took locks 3 times in the loop body. Now it is 2...

I wasn't worried about contention with the waiting threads, but between working threads: they are taking the QueueLock more often now (it was taken once or twice per iteration of the loop body before and is now taken an extra time).
I am not sure we'd be able to measure any difference here, just curious how you thought about this tradeoff!

There are two stages, before and after the execution of as task. The original code sequence used different locks. I suppose this is what you meant by less contention: a worker in the first stage (holding QueueLock while not holding CompletionLock) cannot contend with a worker in the second stage (holding CompletionLock)..

Yes this is what I meant, the "second stage" lock contends with the first stage queue manipulation now, while it didn't before. It also contends with any thread trying to enqueue.
Overall for a given execution there will be quite a lot more (almost doubled?) locking/unlocking of the QueueLock right?

  • the spurious signalling of a condition variable (CompletionCondition) can cause contention because a condition variable has an internal lock.

We're getting into implementation / platform specific, but I thought that signaling was mostly gonna map to one or two futex syscall?

The suggested code sequence has the problem of lost signals. It happens because a waiter thread can change the observed state between waiter's testing and blocking.
...

Thanks! This is likely the sequence that lead me to add this lock at the time... IIRC you can avoid this with low-level use of futex but there isn't a portable solution.

Anyway, I'm fine with this change overall, I don't think this thread pool implementation is optimized for very small tasks anyway.

Anyway, I'm fine with this change overall, I don't think this thread pool implementation is optimized for very small tasks anyway.

May I get a LGTM? :)

mehdi_amini accepted this revision.Apr 27 2020, 8:31 PM
This revision is now accepted and ready to land.Apr 27 2020, 8:31 PM
This revision was automatically updated to reflect the committed changes.