This is an archive of the discontinued LLVM Phabricator instance.

[Verifier] Parallelize verification and dom checking. NFC.
ClosedPublic

Authored by lattner on Jun 13 2021, 9:42 PM.

Details

Summary

This changes the outer verification loop to not recurse into
IsolatedFromAbove operations - instead return them up to a place
where a parallel for loop can process them all in parallel. This
also changes Dominance checking to happen on IsolatedFromAbove
chunks of the region tree, which makes it easy to fold operation
and dominance verification into a single simple parallel regime.

This speeds up firtool in CIRCT from ~40s to 31s on a large
testcase in -verify-each mode (the default). The .fir parser and
module passes in particular benefit from this - FModule passes
(roughly analogous to function passes) were already running the
verifier in parallel as part of the pass manager. This allows
the whole-module passes to verify their enclosed functions /
FModules in parallel.

-verify-each mode is still faster (26.3s on the same testcase),
but we do expect the verifier to take *some* time.

Diff Detail

Event Timeline

lattner created this revision.Jun 13 2021, 9:42 PM
lattner requested review of this revision.Jun 13 2021, 9:42 PM

I filed https://bugs.llvm.org/show_bug.cgi?id=50701 to track cases where the PassManager is verifying things it shouldn't IMO. Fixing that will be an orthogonal improvement to this patch.

rriddle accepted this revision.Jun 14 2021, 12:51 AM

LGTM, thanks!

mlir/lib/IR/Verifier.cpp
148

I don't think the getOperations here is necessary.

185–189

nit: Can you move this doc to the declaration instead?

233

nit: Spell out auto here.

235

Is the mlir:: here necessary?

236

op.getLoc()? I'm assuming this is what region.getLoc does internally.

343–345

Can we change this loop to use llvm::enumerate(op.getOperands()) ?

This revision is now accepted and ready to land.Jun 14 2021, 12:51 AM
lattner marked 6 inline comments as done.Jun 14 2021, 9:26 AM

Thank you for the quick review!

mlir/lib/IR/Verifier.cpp
185–189

Sure.

235

Yes, but not for a good reason. Verifier defined its own emitError for a weird reason. fixed.

343–345

yes, nicer!

lattner updated this revision to Diff 351897.Jun 14 2021, 9:26 AM
lattner marked 3 inline comments as done.

Incorporate River's feedback.

bondhugula added inline comments.
mlir/lib/Pass/Pass.cpp
390 ↗(On Diff #351897)

This comment can be updated.

lattner added inline comments.Jun 14 2021, 10:02 AM
mlir/lib/Pass/Pass.cpp
390 ↗(On Diff #351897)

Thank you for catching this. This wasn't intended to be in this patch. I'll remove it and put it into a separate patch!

lattner updated this revision to Diff 351907.Jun 14 2021, 10:02 AM

Remove the changes to Pass.cpp, they were supposed to be in a follow-on

This revision was landed with ongoing or failed builds.Jun 14 2021, 10:03 AM
This revision was automatically updated to reflect the committed changes.

FYI we are once again seeing hangs triggered by this change. It appears to have something to do with LLVM thread pool cleanup. We've hit it in IREE's OSS python tests, google's internal runs of MLIR's cuda integration tests, and IREE's iree-run-mlir tests (https://source.cloud.google.com/results/invocations/7ed7a557-2db6-4f7a-84e9-9e2dec6b1405/). So far we haven't been able to get a reproduction upstream, so it seems to have using MLIR in some certain environment

This is just using llvm::parallelForEachN (not doing anything particularly fancy) so I can't imagine how it would be different than other similar things using it. It is possible this is exposing a lower level problem in LLVM threading.

In any case, let me know how I can help. I'd prefer not to revert this though, as it is a significant speedup.

This is just using llvm::parallelForEachN (not doing anything particularly fancy) so I can't imagine how it would be different than other similar things using it. It is possible this is exposing a lower level problem in LLVM threading.

In any case, let me know how I can help. I'd prefer not to revert this though, as it is a significant speedup.

I believe it is exposing such a lower-level problem, yeah. At least that's our guess though it's a pretty prickly problem

This is just using llvm::parallelForEachN (not doing anything particularly fancy) so I can't imagine how it would be different than other similar things using it. It is possible this is exposing a lower level problem in LLVM threading.

In any case, let me know how I can help. I'd prefer not to revert this though, as it is a significant speedup.

I am partially speculating (but I've also partially triaged this before on various platforms and don't think I'm wholly wrong). I suspect we're running up against something that will cause another paragraph to be written for the Executor *Executor::getDefaultExecutor() method in Parallel.cpp. With this patch, we are now getting intermittent hangs on exit/global destruct depending on which shared libraries have been loaded in the host process on Linux. I don't think there is anything wrong with this patch, per se, but it is quite consistently triggering some really nasty behavior across a variety of our projects and Linux platforms.

In general, I've found these kinds of global destructor thread shutdowns to be finicky at best, with odd interactions with respect to destruction order, platform, libraries loaded, etc. I don't have further information at this time, but at least speaking up for posterity seems appropriate.

I've got some pretty strong evidence that this is indeed deadlocking during verification processing, but I can't quite explain why.

I was able to capture this backtrace of all threads from a statically compiled iree-translate on Linux doing a very straight-forward sample program compile. We've got evidence of similar deadlocks in a number of other cases across platforms and compilers but have not had the right builds with debug info locally to verify the same deadlock (though it stands to reason):

Thread 3 (Thread 0x7f29fb119700 (LWP 4046230)):
#0  0x00007f29fbb369f4 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/grte/v4/lib64/libpthread.so.0
#1  0x000055e48a8a13b2 in std::__u::__libcpp_condvar_wait (__cv=0x7f29fb118794, __m=0x80) at third_party/llvm/llvm-project/libcxx/include/__threading_support:440
#2  std::__u::condition_variable::wait (this=0x7f29fb118794, lk=...) at third_party/llvm/llvm-project/libcxx/src/condition_variable.cpp:44
#3  0x000055e48a7c260b in llvm::parallel::detail::TaskGroup::~TaskGroup() () at third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/__mutex_base:406
#4  0x000055e48a70dc16 in (anonymous namespace)::OperationVerifier::verifyOpAndDominance(mlir::Operation&) () at third_party/llvm/llvm-project/llvm/include/llvm/Support/Parallel.h:186
#5  0x000055e48a70faff in void std::__u::__function::__policy_invoker<void ()>::__call_impl<std::__u::__function::__default_alloc_func<llvm::parallel::detail::parallel_for_each_n<unsigned long, (anonymous namespace)::OperationVerifier::verifyOpAndDominance(mlir::Operation&)::$_0>(unsigned long, unsigned long, (anonymous namespace)::OperationVerifier::verifyOpAndDominance(mlir::Operation&)::$_0)::{lambda()#1}, void ()> >(std::__u::__function::__policy_storage const*) () at /proc/self/cwd/third_party/llvm/llvm-project/mlir/lib/IR/Verifier.cpp:105
#6  0x000055e48a7c49f1 in void std::__u::__function::__policy_invoker<void ()>::__call_impl<std::__u::__function::__default_alloc_func<llvm::parallel::detail::TaskGroup::spawn(std::__u::function<void ()>)::$_0, void ()> >(std::__u::__function::__policy_storage const*) () at third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/functional:2230
#7  0x000055e48a7c37c4 in llvm::parallel::detail::(anonymous namespace)::ThreadPoolExecutor::work(llvm::ThreadPoolStrategy, unsigned int) () at third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/functional:2230
#8  0x000055e48a7c3850 in void* std::__u::__thread_proxy<std::__u::tuple<std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct> >, llvm::parallel::detail::(anonymous namespace)::ThreadPoolExecutor::ThreadPoolExecutor(llvm::ThreadPoolStrategy)::{lambda()#1}::operator()() const::{lambda()#1}> >(llvm::parallel::detail::(anonymous namespace)::ThreadPoolExecutor::ThreadPoolExecutor(llvm::ThreadPoolStrategy)::{lambda()#1}::operator()() const::{lambda()#1}) () at /proc/self/cwd/third_party/llvm/llvm-project/llvm/lib/Support/Parallel.cpp:52
#9  0x00007f29fbb324e8 in start_thread () from /usr/grte/v4/lib64/libpthread.so.0
#10 0x00007f29fb9a722d in clone () from /usr/grte/v4/lib64/libc.so.6

Thread 2 (Thread 0x7f29fb91b700 (LWP 4046229)):
#0  0x00007f29fbb369f4 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/grte/v4/lib64/libpthread.so.0
#1  0x000055e48a8a13b2 in std::__u::__libcpp_condvar_wait (__cv=0x7f29fb91a744, __m=0x80) at third_party/llvm/llvm-project/libcxx/include/__threading_support:440
#2  std::__u::condition_variable::wait (this=0x7f29fb91a744, lk=...) at third_party/llvm/llvm-project/libcxx/src/condition_variable.cpp:44
#3  0x000055e48a7c260b in llvm::parallel::detail::TaskGroup::~TaskGroup() () at third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/__mutex_base:406
#4  0x000055e48a70dc16 in (anonymous namespace)::OperationVerifier::verifyOpAndDominance(mlir::Operation&) () at third_party/llvm/llvm-project/llvm/include/llvm/Support/Parallel.h:186
#5  0x000055e48a70faff in void std::__u::__function::__policy_invoker<void ()>::__call_impl<std::__u::__function::__default_alloc_func<llvm::parallel::detail::parallel_for_each_n<unsigned long, (anonymous namespace)::OperationVerifier::verifyOpAndDominance(mlir::Operation&)::$_0>(unsigned long, unsigned long, (anonymous namespace)::OperationVerifier::verifyOpAndDominance(mlir::Operation&)::$_0)::{lambda()#1}, void ()> >(std::__u::__function::__policy_storage const*) () at /proc/self/cwd/third_party/llvm/llvm-project/mlir/lib/IR/Verifier.cpp:105
#6  0x000055e48a7c49f1 in void std::__u::__function::__policy_invoker<void ()>::__call_impl<std::__u::__function::__default_alloc_func<llvm::parallel::detail::TaskGroup::spawn(std::__u::function<void ()>)::$_0, void ()> >(std::__u::__function::__policy_storage const*) () at third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/functional:2230
#7  0x000055e48a7c37c4 in llvm::parallel::detail::(anonymous namespace)::ThreadPoolExecutor::work(llvm::ThreadPoolStrategy, unsigned int) () at third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/functional:2230
#8  0x000055e48a7c35ce in void* std::__u::__thread_proxy<std::__u::tuple<std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct> >, llvm::parallel::detail::(anonymous namespace)::ThreadPoolExecutor::ThreadPoolExecutor(llvm::ThreadPoolStrategy)::{lambda()#1}> >(std::__u::tuple<std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct> >, llvm::parallel::detail::(anonymous namespace)::ThreadPoolExecutor::ThreadPoolExecutor(llvm::ThreadPoolStrategy)::{lambda()#1}>) () at /proc/self/cwd/third_party/llvm/llvm-project/llvm/lib/Support/Parallel.cpp:57
#9  0x00007f29fbb324e8 in start_thread () from /usr/grte/v4/lib64/libpthread.so.0
#10 0x00007f29fb9a722d in clone () from /usr/grte/v4/lib64/libc.so.6

Thread 1 (Thread 0x7f29fb96dcc0 (LWP 4046189)):
#0  0x00007f29fbb369f4 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/grte/v4/lib64/libpthread.so.0
#1  0x000055e48a8a13b2 in std::__u::__libcpp_condvar_wait (__cv=0x7fffcdc78f74, __m=0x80) at third_party/llvm/llvm-project/libcxx/include/__threading_support:440
#2  std::__u::condition_variable::wait (this=0x7fffcdc78f74, lk=...) at third_party/llvm/llvm-project/libcxx/src/condition_variable.cpp:44
#3  0x000055e48a7c260b in llvm::parallel::detail::TaskGroup::~TaskGroup() () at third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/__mutex_base:406
#4  0x000055e48a70dc16 in (anonymous namespace)::OperationVerifier::verifyOpAndDominance(mlir::Operation&) () at third_party/llvm/llvm-project/llvm/include/llvm/Support/Parallel.h:186
#5  0x000055e48a70d925 in mlir::verify(mlir::Operation*) () at /proc/self/cwd/third_party/llvm/llvm-project/mlir/lib/IR/Verifier.cpp:379
#6  0x000055e48a5d7cbe in mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) () at /proc/self/cwd/third_party/llvm/llvm-project/mlir/lib/Pass/Pass.cpp:409
#7  0x000055e48a5d80e6 in mlir::detail::OpToOpPassAdaptor::runPipeline(llvm::iterator_range<llvm::pointee_iterator<std::__u::unique_ptr<mlir::Pass, std::__u::default_delete<mlir::Pass> >*, mlir::Pass> >, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) () at /proc/self/cwd/third_party/llvm/llvm-project/mlir/lib/Pass/Pass.cpp:444
#8  0x000055e48a5d9c54 in mlir::PassManager::run(mlir::Operation*) () at /proc/self/cwd/third_party/llvm/llvm-project/mlir/lib/Pass/Pass.cpp:710
#9  0x000055e4892342e8 in mlir::iree_compiler::translateFromMLIRToVMBytecodeModuleWithFlags(mlir::ModuleOp, llvm::raw_ostream&) () at /proc/self/cwd/third_party/iree/iree/compiler/Translation/IREEVM.cpp:217
#10 0x000055e48a371c8c in mlir::LogicalResult std::__u::__function::__policy_invoker<mlir::LogicalResult (llvm::SourceMgr&, llvm::raw_ostream&, mlir::MLIRContext*)>::__call_impl<std::__u::__function::__default_alloc_func<mlir::TranslateFromMLIRRegistration::TranslateFromMLIRRegistration(llvm::StringRef, std::__u::function<mlir::LogicalResult (mlir::ModuleOp, llvm::raw_ostream&)> const&, std::__u::function<void (mlir::DialectRegistry&)>)::$_1, mlir::LogicalResult (llvm::SourceMgr&, llvm::raw_ostream&, mlir::MLIRContext*)> >(std::__u::__function::__policy_storage const*, llvm::SourceMgr&, llvm::raw_ostream&, mlir::MLIRContext*) () at third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/functional:2230
#11 0x000055e4872b5ea2 in main () at third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/functional:2230

I was not successful in getting a deadlock in debug mode, but I did get one with ASAN enabled, which revealed an additional clue in the stack trace:

#0  0x00007f9ad5a6a9f4 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/grte/v4/lib64/libpthread.so.0
#1  0x000056331a2e55da in std::__u::__libcpp_condvar_wait (__cv=0x7f9ad29ef454, __m=0x80) at third_party/llvm/llvm-project/libcxx/include/__threading_support:440
#2  std::__u::condition_variable::wait (this=0x7f9ad29ef454, lk=...) at third_party/llvm/llvm-project/libcxx/src/condition_variable.cpp:44
#3  0x0000563319efd4eb in llvm::parallel::detail::Latch::sync() const () at third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/__mutex_base:406
#4  0x0000563319efce52 in llvm::parallel::detail::TaskGroup::~TaskGroup() () at third_party/llvm/llvm-project/llvm/include/llvm/Support/Parallel.h:43
#5  0x0000563319d4416d in (anonymous namespace)::OperationVerifier::verifyOpAndDominance(mlir::Operation&) () at third_party/llvm/llvm-project/llvm/include/llvm/Support/Parallel.h:186
#6  0x0000563319d43a2c in mlir::verify(mlir::Operation*) () at /proc/self/cwd/third_party/llvm/llvm-project/mlir/lib/IR/Verifier.cpp:379
#7  0x0000563319978e6f in mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) () at /proc/self/cwd/third_party/llvm/llvm-project/mlir/lib/Pass/Pass.cpp:409
#8  0x0000563319979dc1 in mlir::detail::OpToOpPassAdaptor::runPipeline(llvm::iterator_range<llvm::pointee_iterator<std::__u::unique_ptr<mlir::Pass, std::__u::default_delete<mlir::Pass> >*, mlir::Pass> >, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) () at /proc/self/cwd/third_party/llvm/llvm-project/mlir/lib/Pass/Pass.cpp:444
#9  0x000056331997eddf in mlir::PassManager::run(mlir::Operation*) () at /proc/self/cwd/third_party/llvm/llvm-project/mlir/lib/Pass/Pass.cpp:710
#10 0x000056331599da45 in mlir::iree_compiler::translateFromMLIRToVMBytecodeModuleWithFlags(mlir::ModuleOp, llvm::raw_ostream&) () at /proc/self/cwd/third_party/iree/iree/compiler/Translation/IREEVM.cpp:217
#11 0x00005633191f174b in mlir::LogicalResult std::__u::__function::__policy_invoker<mlir::LogicalResult (llvm::SourceMgr&, llvm::raw_ostream&, mlir::MLIRContext*)>::__call_impl<std::__u::__function::__default_alloc_func<mlir::TranslateFromMLIRRegistration::TranslateFromMLIRRegistration(llvm::StringRef, std::__u::function<mlir::LogicalResult (mlir::ModuleOp, llvm::raw_ostream&)> const&, std::__u::function<void (mlir::DialectRegistry&)>)::$_1, mlir::LogicalResult (llvm::SourceMgr&, llvm::raw_ostream&, mlir::MLIRContext*)> >(std::__u::__function::__policy_storage const*, llvm::SourceMgr&, llvm::raw_ostream&, mlir::MLIRContext*) () at third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/functional:2230
#12 0x000056330f467f9e in main::$_0::operator()(std::__u::unique_ptr<llvm::MemoryBuffer, std::__u::default_delete<llvm::MemoryBuffer> >, llvm::raw_ostream&) const ()
    at third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/functional:2230
#13 0x000056330f46706d in main () at /proc/self/cwd/third_party/iree/iree/tools/iree-translate-main.cc:133

Basically the same except that this one captures the offending frame: everything is deadlocking on the Latch::sync() of ~TaskGroup().

I and a couple of others spent ~an hour piecing apart the Parallel.h/cpp code and couldn't spot the race, but I believe there is an issue here that the verifier parallelism is tickling. Probably latent and just triggering based on workload ergonomics? In any case, I think we should disable parallelism in this patch for now. It's a great speedup but not at the cost of stability.

I've sent out https://reviews.llvm.org/D104516 as a potential future direction for the threading in MLIR. It adds a ThreadPool to MLIRContext that can be used to provide more consistent thread usage, and I believe it provides a few other advantages as well (such as avoiding any potential static destruction issues).

For a while we only had repros using Blaze or Bazel. This is a pretty thorny bug to pin down. I managed to get it to hang using an OSS build of IREE with CMake though. Here's the backtrace for all threads: https://gist.github.com/GMNGeoffrey/0bdf40644efcda68db0c0ace734273b3

Ok, well I'm still not clear what is going on here - if the static dtor for the thread pool is running while there is still MLIR stuff going on then there is going to be all sorts of bad things that come unraveled. However, I'm totally ok with River's patch to add threadpool to MLIRContext.

Ok, well I'm still not clear what is going on here - if the static dtor for the thread pool is running while there is still MLIR stuff going on then there is going to be all sorts of bad things that come unraveled. However, I'm totally ok with River's patch to add threadpool to MLIRContext.

Yeah, I don't know yet either, despite a lot of time staring at it. But there is definitely something going on across a couple of projects (including just core mlir tests) at a low rate, which shows up for us given the high rate of test runs. Given the frequency, it is hard to repro. But I'm also +1 on River's patch and would like to see that go in.

Have you tried shoving the global executor into a ManagedStatic?

Have you tried shoving the global executor into a ManagedStatic?

Not the executor itself. The ThreadPool is in a ManagedStatic. Most of the triage last week was assuming this to be a shutdown hang, and it was while reproducing that (i.e. doing explicit llvm_shutdown of ManagedStatics) that Geoffrey and I both discovered that the hang/deadlock is actually happening during verification, not at shutdown, and we finally got the backtraces from a process that wasn't doing anything "funny" with threads or global state (i.e. a stock *-translate executable). It took quite a bit of troubleshooting to get it that far because it only seems to happen in optimized builds, and most of our bots that test those at a high enough frequency were not building with enough symbols to debug much. (we have also had exit time issues with some systems, and I think this was incorrectly pattern matched to that early on)

I don't have a theory on the root cause right now, despite having looked at this quite a bit. It appears that River's patch does not seem to cause a hang. I don't think that it is related to pulling the thread pool to the context level (which is a good change on its own, imo) but something else being done differently. We may be leaving a race latent in the old path -- I just don't know at this point.

The backtrace Stella posted above https://reviews.llvm.org/D104207#2826290 is an example of hang that isn't during shutdown.

It is really puzzling because it looks like multiple threads managed to get their TaskGroup with parallelism enabled, which isn't supposed to be possible. Many of us stared at the code for a while but we haven't made progress! (it really does not help that this is hard to reproduce...)

The problem with pulling this into an MLIRContext is that parallelism isn't specific to MLIR. It is specific to the machine that is being run on. It's not like MLIR gets some cpus and (LLVM or higher-level SW) gets others.

-Chris

The problem with pulling this into an MLIRContext is that parallelism isn't specific to MLIR. It is specific to the machine that is being run on. It's not like MLIR gets some cpus and (LLVM or higher-level SW) gets others.

-Chris

Agreed - and this applies more-so to the current state of having it as an LLVM-level static. Ultimately, I'm fine with APIs like MLIRContext which default to taking control of their own threading context, but any system that is actually trying to optimally manage its threading environment needs an ability to inject/control how threading is done (sharing thread pools, limiting parallelism, etc). Not making the ThreadPool/global executor tied to process-level statics seems like a good step in that direction (I haven't looked at River's patch in detail to determine how much it boxes us out of more configurability down the road). I would assume that follow-on APIs would allow the creation of MLIRContext instances with more control over their threading resources.

The problem with pulling this into an MLIRContext is that parallelism isn't specific to MLIR. It is specific to the machine that is being run on. It's not like MLIR gets some cpus and (LLVM or higher-level SW) gets others.

-Chris

We already have the situation where MLIR doesn't share thread pools with higher level SW, because that higher level SW doesn't necessarily try to use the same threading APIs (well technically right now even if it did they wouldn't get multi-threaded execution, but that is more of a current implementation problem). I know several downstream users where this is the case. If we want to share thread pools with higher level SW, we likely should have a virtual thread pool interface that gets used so that users can hook in their own implementations. If we go that route, I don't see much difference between static or not, given that MLIR usages of threading already requires access to the context (to see if threading is disabled, handling diagnostics, etc.) The major benefits of non-static, are that it is much much easier to understand/debug issues, and provides much greater control over the thread pool used.

Outside of the current verifier incident, I've also run into this same issue when multi-threading the inliner. It was equally painful to debug, extremely hard to reliably reproduce, and I was only barely able to massage things enough so that it disappeared.

I think D61115 does not completely solve the issue. The issue you're seeing in https://reviews.llvm.org/D104207#2826290 is essentially PR41508.
The sequence that leads to the bug seems to be:

  1. main() calls parallelForEachN with, say, only one element to be processed by the TaskGroup.
  2. Since it's the first active TaskGroup so TaskGroupInstances is incremented to 1, and the TaskGroup::Parallel flag is set.
  3. TaskGroup::spawn() is called and since we're parallel we construct the ThreadPoolExecutor and push the work on its thread stack.
  4. parallelForEachN goes of scope and calls the destructor for TaskGroup.
  5. TaskGroupInstances is decremented to 0.
  6. The destructor of Latch is called and calls sync() which suspends the main thread.
  7. The work that was scheduled in 3. is finally executed on a ThreadPoolExecutor thread, so another parallelForEachN is called.
  8. A new TaskGroup is constructed on the stack but since TaskGroupInstances is 0, we start again from 2. (incrementing it to 1, then we set Parallel to true and we schedule tasks on the ThreadPoolExecutor)
  9. In your case, 3.-8. happens several times.

At this point we can end up in a situation where all ThreadPoolExecutor's threads are waiting in Latch.sync() and any potential task that could have unblocked them cannot be run.
This creates the situation in PR41508 or what you're seeing above.

Probably changing to something like:
TaskGroup::~TaskGroup() { L.sync(); --TaskGroupInstances; }
would fix it, but this has to be confirmed by someone who could repro the bug in the first place.
Also calling Cond.notify_all(); inside the lock isn't optimal, this could be done outside of lock.

What do you think?