This is an archive of the discontinued LLVM Phabricator instance.

[Support/Parallel] Add a special case for 0/1 items to llvm::parallel_for_each.
ClosedPublic

Authored by lattner on May 1 2021, 2:07 PM.

Details

Summary

This avoids the non-trivial overhead of creating a TaskGroup in these degenerate
cases, but also exposes parallelism. It turns out that the default executor
underlying TaskGroup prevents recursive parallelism - so an instance of a task
group being alive will make nested ones become serial.

This is a big issue in MLIR in some dialects, if they have a single instance of
an outer op (e.g. a firrtl.circuit) that has many parallel ops within it (e.g.
a firrtl.module). This patch side-steps the problem by avoiding creating the
TaskGroup in the unneeded case. See this issue for more details:
https://github.com/llvm/circt/issues/993

Note that this isn't a really great solution for the general case of nested
parallelism. A redesign of the TaskGroup stuff would be better, but would be
a much more invasive change.

Diff Detail

Event Timeline

lattner created this revision.May 1 2021, 2:07 PM
lattner requested review of this revision.May 1 2021, 2:07 PM

This seems obvious to me, but I'd appreciate a quick look from someone else.

lattner accepted this revision.May 3 2021, 10:07 AM

This is obvious improvement, so I'm going to self approve. TaskGroup could be improved significantly but this fits within the current architecture.

This revision is now accepted and ready to land.May 3 2021, 10:07 AM

LGTM from me. Though I think it would be good to reduce some of the duplication in these methods.

rnk added a comment.May 4 2021, 10:49 AM

Change looks good.

In the long run, we should try to build on an existing C++17 parallel algorithms library, such as the pstl in our own monorepo. I don't know how far off it is before we can raise the LLVM toolchain requirements to C++17, and after that, how much platform support will hold us back, but maybe we can hack the pstl into the LLVM build if it becomes a sticking point.

The important thing for LLVM is that we express our code using well-known, deterministic parallel algorithms, and we can pick up better implementations later.

How expensive is spinning up a TaskGroup? Mostly the ctor/dtor of TaskGroup?

I created D117510 to remove one Fn(...) call which may make instantiated code slightly smaller.

Change looks good.

In the long run, we should try to build on an existing C++17 parallel algorithms library, such as the pstl in our own monorepo. I don't know how far off it is before we can raise the LLVM toolchain requirements to C++17, and after that, how much platform support will hold us back, but maybe we can hack the pstl into the LLVM build if it becomes a sticking point.

The important thing for LLVM is that we express our code using well-known, deterministic parallel algorithms, and we can pick up better implementations later.

Keeping LLVM's own version provides some flexibility. E.g. when lld is writing output sections parallely, we may tune the function to have a better TaskSize.
I tested that for -DCMAKE_BUILD_TYPE=Release clang, using a [[ https://gist.github.com/MaskRay/540e7bb31408afcee2b827140bef33e3 | fixed TaskSize==128 ]] is 1.02x as fast as the current code, though it has no noticeable difference when linking chrome.
Such preference may not be expressable with the C++ STL.

Another thing we need to think about is code bloat. Many template heavy libraries expand to huge amount of code. For some programs like a linker, we favor calling parallel* in many places. The expanded code may be significant.
(I think mold has experienced this.)

Spinning up the threadpool is expensive, but the bigger issue is that Threading.h doesn't support reentrant concurrency. If you have a parallel for loop with one element, then that element does a parallel for loop over 10000 elements, it will run in serial. :-(