We currently process one OutputSection at a time and for each OutputSection
write contained input sections in parallel. This strategy does not leverage
multi-threading well. Instead, parallelize writes of different OutputSections.
The default TaskSize for parallelFor often leads to inferior sharding. We
prepare the task in the caller instead.
- Move llvm::parallel::detail::TaskGroup to llvm::parallel::TaskGroup
- Add llvm::parallel::TaskGroup::execute.
- Change writeSections to declare TaskGroup and pass it to writeTo.
Speed-up with --threads=8:
- clang -DCMAKE_BUILD_TYPE=Release: 1.11x as fast
- clang -DCMAKE_BUILD_TYPE=Debug: 1.10x as fast
- chrome -DCMAKE_BUILD_TYPE=Release: 1.04x as fast
- scylladb build/release: 1.09x as fast
On M1, many benchmarks are a small fraction of a percentage faster. Mozilla showed the largest difference with the patch being about 1.03x as fast.
Given there is already a "live" TaskGroup, I don't think this will actually run in parallel, IIUC. However, this is a limitation of the current parallel implementation.