This is an archive of the discontinued LLVM Phabricator instance.

[Support][Parallel] Initialize threadIndex and add assertion checking its usage.
ClosedPublic

Authored by avl on Apr 21 2023, 4:58 AM.

Details

Summary

That patch adds a check for threadIndex being used with only threads
created by ThreadPoolExecutor. This helps catch two types of errors:

  1. If a thread is created not by ThreadPoolExecutor its index may clash with the index of another thread. Using threadIndex, in that case, may lead to a data race.
  2. Index of the main thread(threadIndex == 0) currently clashes with the index of thread0 in ThreadPoolExecutor threads. That may lead to a data race if main thread and thread0 are executed concurrently.

This patch allows execution tasks on the main thread only in case
parallel::strategy.ThreadsRequested == 1. In all other cases,
assertions check that threadIndex != UINT_MAX(i.e. that task
is executed on a thread created by ThreadPoolExecutor).

This patch is a followup of discussion from D142318

Depends on D148728

Diff Detail

Event Timeline

avl created this revision.Apr 21 2023, 4:58 AM
avl requested review of this revision.Apr 21 2023, 4:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2023, 4:58 AM
avl edited the summary of this revision. (Show Details)Apr 21 2023, 5:27 AM
andrewng added inline comments.Apr 21 2023, 8:29 AM
llvm/lib/Support/Parallel.cpp
226

The following is the commit for which the NumItems <= 1 case was important: 5fa9d416 "[Support/Parallel] Add a special case for 0/1 items to llvm::parallel_for_each.". I don't know if it's still important but probably should be checked.

avl added inline comments.Apr 22 2023, 2:50 AM
llvm/lib/Support/Parallel.cpp
226

The following is the commit for which the NumItems <= 1 case was important: 5fa9d416 "[Support/Parallel] Add a special case for 0/1 items to llvm::parallel_for_each.". I don't know if it's still important but probably should be checked.

Thank you for the pointer!

That problem is still important. I would suggest to use another workaround though.

The original problem is connected with the fact that parallelFor() functions do not support nested parallelism.
If nested TaskGroups would be allowed to be Parallel then deadlock could happen:
when all threads are busy and are waiting for nested tasks the task queue does not move.
The test case from https://github.com/llvm/circt/issues/993 looks like this:

parallelFor (Files) {
   parallelFor(Passes) {
   }
}

A nested loop for Passes is not Parallel since nested parallelism is forbidden. But if "Files" have only one file,
the workaround is to run root parallelFor sequentially and run nested parallelFor parallelly. It gives a performance improvement.
That is why this workaround for NumItems is necessary.

The drawbacks of the workaround:

  1. It helps one concrete case.
  2. It makes using getThreadIndex() dangerous if parallelFor() is called from different threads. parallelFor() may be run on calling threads and then the same threadIndex would be used concurrently.

The following workaround could probably solve the problem better:

std::function<void()> Fn = [&]() {
  parallelFor(Passes) {
  }
};

ThreadPool Pool;

for (Files) {
  Pool.async(Fn);

Pool.wait();

It does not have a problem with using threadIndex. It also could give more parallelism. Let`s say we have two files.
The current solution will use only two threads(as internal parallelFor is sequential), while the above solution would use all
available threads.

Another solution would be to implement support of nested parallelism for TaskGroups. But it looks to be much bigger task.

Will put that info into https://github.com/llvm/circt/issues/993

avl added inline comments.Apr 25 2023, 2:22 AM
llvm/lib/Support/Parallel.cpp
226

@lattner Chris, Would you mind to comment on whether it would be OK to remove "NumItems > 1" optimization and use ThreadPool for nested parallelFor instead, please?

I haven't looked in detail at this patch, but I can describe the problem that I was hitting. The situation was happening in MLIR and other clients that have an expose nested parallelism, e.g. in its parallel pass manager. The thread pool at the time refused to do nested cases in parallel, so if you did:

parallel for x in ...
  parallel for y in ...
     work

it would do the outer loop in parallel and then refuse to do the inner loop in parallel. This was sort of ok, but was really really bad when the outer loop actually had only a single entry. That would end up running the inner loop in series even though there was no parallelism in play.

The change to do 0/1 without "parallelism" led to a 50x speedup on a CIRCT workload. It also eliminates a bit of fixed overhead in those small cases, but that is a more minor issue.

-Chris

avl added a comment.EditedApr 25 2023, 8:58 AM

I haven't looked in detail at this patch, but I can describe the problem that I was hitting. The situation was happening in MLIR and other clients that have an expose nested parallelism, e.g. in its parallel pass manager. The thread pool at the time refused to do nested cases in parallel, so if you did:

parallel for x in ...
  parallel for y in ...
     work

it would do the outer loop in parallel and then refuse to do the inner loop in parallel. This was sort of ok, but was really really bad when the outer loop actually had only a single entry. That would end up running the inner loop in series even though there was no parallelism in play.

The change to do 0/1 without "parallelism" led to a 50x speedup on a CIRCT workload. It also eliminates a bit of fixed overhead in those small cases, but that is a more minor issue.

-Chris

I suggest to use another approach to have more parallelism:

std::function<void()> Fn = [&]() {
  parallel for y in ...
};

ThreadPool Pool;

for x in ...
  Pool.async(Fn);

Pool.wait();

What do you think of it? it could give more benefits than 0/1 approach.

avl added a comment.Apr 25 2023, 2:16 PM

Btw, looking at circt/llvm/mlir/include/mlir/IR/Threading.h it seems that it does not use llvm::parallel::parrallelFor and already uses similar solution with separate ThreadPool. Thus it looks like we can remove "NumItems > 1" workaround in llvm/lib/Support/Parallel.cpp

template <typename IteratorT, typename FuncT>
LogicalResult failableParallelForEach(MLIRContext *context, IteratorT begin,
                                      IteratorT end, FuncT &&func) {
  unsigned numElements = static_cast<unsigned>(std::distance(begin, end));
  if (numElements == 0)
    return success();

  // If multithreading is disabled or there is a small number of elements,
  // process the elements directly on this thread.
  if (!context->isMultithreadingEnabled() || numElements <= 1) {
    for (; begin != end; ++begin)
      if (failed(func(*begin)))
        return failure();
    return success();
  }

  // Build a wrapper processing function that properly initializes a parallel
  // diagnostic handler.
  ParallelDiagnosticHandler handler(context);
  std::atomic<unsigned> curIndex(0);
  std::atomic<bool> processingFailed(false);
  auto processFn = [&] {
    while (!processingFailed) {
      unsigned index = curIndex++;
      if (index >= numElements)
        break;
      handler.setOrderIDForThread(index);
      if (failed(func(*std::next(begin, index))))
        processingFailed = true;
      handler.eraseOrderIDForThread();
    }
  };

  // Otherwise, process the elements in parallel.
  llvm::ThreadPool &threadPool = context->getThreadPool();
  llvm::ThreadPoolTaskGroup tasksGroup(threadPool);
  size_t numActions = std::min(numElements, threadPool.getThreadCount());
  for (unsigned i = 0; i < numActions; ++i)
    tasksGroup.async(processFn);
  // If the current thread is a worker thread from the pool, then waiting for
  // the task group allows the current thread to also participate in processing
  // tasks from the group, which avoid any deadlock/starvation.
  tasksGroup.wait();
  return failure(processingFailed);
}

It isn't just the function pass manager:

~/Projects/circt> grep -i -r parallel lib (trimmed):
lib/Transforms/StripDebugInfoWithPred.cpp:  parallelForEach(
lib/Conversion/ExportVerilog/ExportVerilog.cpp:  parallelForEach(context, thingsToEmit, [&](StringOrOpToEmit &stringOrOp) {
lib/Conversion/ExportVerilog/ExportVerilog.cpp:  if (failed(failableParallelForEach(
lib/Conversion/ExportVerilog/ExportVerilog.cpp:  emitter.emitOps(list, output->os(), /*parallelize=*/false);
lib/Conversion/ExportVerilog/ExportVerilog.cpp:  // Emit each file in parallel if context enables it.
lib/Conversion/ExportVerilog/ExportVerilog.cpp:  parallelForEach(module->getContext(), emitter.files.begin(),
lib/Conversion/ExportVerilog/ExportVerilog.cpp:  if (failed(failableParallelForEach(
lib/Conversion/ExportVerilog/LegalizeNames.cpp:  mlir::parallelForEach(
lib/Conversion/FIRRTLToHW/LowerToHW.cpp:    memories = llvm::parallelTransformReduce(
lib/Conversion/FIRRTLToHW/LowerToHW.cpp:  auto result = mlir::failableParallelForEachN(
lib/Dialect/HW/InnerSymbolTable.cpp:  return mlir::failableParallelForEach(
lib/Dialect/HW/InnerSymbolTable.cpp:  return mlir::failableParallelForEach(
lib/Dialect/FIRRTL/Transforms/IMDeadCodeElim.cpp:  mlir::parallelForEach(circuit.getContext(),
lib/Dialect/FIRRTL/Transforms/IMConstProp.cpp:  mlir::parallelForEach(circuit.getContext(),
lib/Dialect/FIRRTL/Transforms/VBToBV.cpp:      failableParallelForEach(&getContext(), modules, [&](FModuleOp module) {
lib/Dialect/FIRRTL/Transforms/InnerSymbolDCE.cpp:  parallelForEach(&getContext(), modules,
lib/Dialect/FIRRTL/Transforms/LowerTypes.cpp:  auto result = failableParallelForEach(&getContext(), ops, lowerModules);
lib/Dialect/FIRRTL/Transforms/MemToRegOfVec.cpp:    mlir::parallelForEach(circtOp.getContext(), dutModuleSet,
lib/Dialect/FIRRTL/Import/FIRParser.cpp:  auto anyFailed = mlir::failableParallelForEachN(

simialrly the MLIR inliner and verifier and passmgr have things like:

./mlir/lib/Pass/Pass.cpp:  if (failed(failableParallelForEach(context, opInfos, processFn)))
./mlir/lib/IR/Verifier.cpp:  if (failed(failableParallelForEach(
./mlir/lib/Transforms/Inliner.cpp:  return failableParallelForEach(ctx, nodesToVisit, [&](CallGraphNode *node) {
avl added a comment.Apr 26 2023, 2:12 AM

all above places are calls to mlir::failableParallelForEach, right? The llvm::parallel::parrallelFor is not used . Thus, it looks like llvm::parallel::parrallelFor may be changed ?

all above places are calls to mlir::failableParallelForEach, right? The llvm::parallel::parrallelFor is not used . Thus, it looks like llvm::parallel::parrallelFor may be changed ?

I haven't checked thoroughly but it does look like mlir has moved away from llvm::parallel and its parallelFor implementation supports nested parallel for loops.

llvm/include/llvm/Support/Parallel.h
50–54

I tend to get a bad feeling when code is duplicated as this is in Parallel.cpp. I know macros aren't great either but perhaps define the "implementation" as a macro and use that here and in Parallel.cpp?

llvm/lib/Support/Parallel.cpp
249–250

IIUC, this will be executed if parallel::strategy.ThreadsRequested == 1 and in this case, getThreadIndex() will return UINT_MAX. Is this the intended behaviour?

avl added inline comments.Apr 26 2023, 9:26 AM
llvm/lib/Support/Parallel.cpp
249–250

no, that is not intended. Good catch! Thank you.

avl updated this revision to Diff 517314.Apr 26 2023, 2:12 PM

addressed comments(used macros for getThreadIndex implementation,
set threadIndex to 0 in case parallel::strategy.ThreadsRequested == 1)

andrewng added inline comments.Apr 28 2023, 7:34 AM
llvm/include/llvm/Support/Parallel.h
32–37

I think this can be moved within the #if LLVM_ENABLE_THREADS just below.

Perhaps the -> a in the assert string.

llvm/lib/Support/Parallel.cpp
103

This setting of parallel::threadIndex = 0 (also in llvm::parallelFor) feels a little "odd". Just wondering if an alternative would be to check parallel::strategy.ThreadsRequested == 1 in getThreadIndex() and return 0 there?

avl updated this revision to Diff 518158.Apr 29 2023, 3:45 AM

addressed comments(added check for parallel::strategy.ThreadsRequested == 1 into getThreadIndex)

andrewng accepted this revision.May 2 2023, 6:17 AM

LGTM.

This revision is now accepted and ready to land.May 2 2023, 6:17 AM
avl added a comment.May 2 2023, 8:46 AM

Thank you for the review!

It looks like this caused a failure on AIX, could you take a look please?

https://lab.llvm.org/buildbot/#/builders/214/builds/7230/steps/6/logs/FAIL__LLVM-Unit__84

avl added a comment.May 3 2023, 6:51 AM

It looks like this caused a failure on AIX, could you take a look please?

https://lab.llvm.org/buildbot/#/builders/214/builds/7230/steps/6/logs/FAIL__LLVM-Unit__84

Hi, thank you for reporting the issue. At first glance it looks unrelated, but I will check more.

https://lab.llvm.org/buildbot/#/builders/214/builds/7230/steps/6/logs/FAIL__LLVM-Unit__84

Hi, thank you for reporting the issue. At first glance it looks unrelated, but I will check more.

I suspect it's more likely because of D149173.

My mistake, I confirmed locally that it's D149173. Thanks