This is an archive of the discontinued LLVM Phabricator instance.

[Support] Attempt to fix deadlock in ThreadGroup
ClosedPublic

Authored by aganea on Sep 16 2021, 1:09 PM.

Details

Summary

This is an attempt to fix the situation described by https://reviews.llvm.org/D104207#2826290 and PR41508.

See sequence of operations leading to the bug in https://reviews.llvm.org/D104207#3004689

We ensure that the Latch is completely "free" before decrementing the number of TaskGroupInstances.

Diff Detail

Event Timeline

aganea created this revision.Sep 16 2021, 1:09 PM
aganea requested review of this revision.Sep 16 2021, 1:09 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 16 2021, 1:09 PM
mehdi_amini accepted this revision.Sep 16 2021, 2:21 PM
This revision is now accepted and ready to land.Sep 16 2021, 2:21 PM
MaskRay added inline comments.Sep 16 2021, 2:31 PM
llvm/lib/Support/Parallel.cpp
155

L is a member of TaskGroup. Hasn't its destructor already run sync?

mehdi_amini added inline comments.Sep 16 2021, 2:35 PM
llvm/lib/Support/Parallel.cpp
155

~TaskGroup() is invoked before destructing the individual members of the TaskGroup class.

MaskRay accepted this revision.Sep 16 2021, 2:43 PM

Hmm. ~Latch should be deleted, then.

aganea updated this revision to Diff 373237.Sep 17 2021, 8:46 AM
aganea edited the summary of this revision. (Show Details)

Removed the sync from ~Latch().
Removed the std::condition_variable::notify_all() improvement since it was occasionally breaking on Windows (and Linux too it seems).

Could you please take a second look?

I've removed the change to std::condition_variable::notify_all() because it was introducing a new issue, where the Latch object would be destroyed in-between the release of the mutex in Latch::dec() and the call to notify_all() in the same function. This is probably because the sleeping thread was awaken spuriously before notify_all() was even called, like mentioned here:

https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-sleepconditionvariablecs

Condition variables are subject to spurious wakeups (those not associated with an explicit wake) and stolen wakeups (another thread manages to run before the woken thread). Therefore, you should recheck a predicate (typically in a while loop) after a sleep operation returns.

This revision was automatically updated to reflect the committed changes.

Managed to end up here after looking at the query on D70447. Should this patch be cherry picked to the llvm-13 release branch?

Managed to end up here after looking at the query on D70447. Should this patch be cherry picked to the llvm-13 release branch?

Fine with me, I think you need to file a bug for @tstellar