This is an archive of the discontinued LLVM Phabricator instance.

[ELF] Parallelize writes of different OutputSections
ClosedPublic

Authored by MaskRay on Aug 5 2022, 1:24 AM.

Details

Summary

We currently process one OutputSection at a time and for each OutputSection
write contained input sections in parallel. This strategy does not leverage
multi-threading well. Instead, parallelize writes of different OutputSections.

The default TaskSize for parallelFor often leads to inferior sharding. We
prepare the task in the caller instead.

  • Move llvm::parallel::detail::TaskGroup to llvm::parallel::TaskGroup
  • Add llvm::parallel::TaskGroup::execute.
  • Change writeSections to declare TaskGroup and pass it to writeTo.

Speed-up with --threads=8:

  • clang -DCMAKE_BUILD_TYPE=Release: 1.11x as fast
  • clang -DCMAKE_BUILD_TYPE=Debug: 1.10x as fast
  • chrome -DCMAKE_BUILD_TYPE=Release: 1.04x as fast
  • scylladb build/release: 1.09x as fast

On M1, many benchmarks are a small fraction of a percentage faster. Mozilla showed the largest difference with the patch being about 1.03x as fast.

Diff Detail

Event Timeline

MaskRay created this revision.Aug 5 2022, 1:24 AM
MaskRay requested review of this revision.Aug 5 2022, 1:24 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 5 2022, 1:24 AM
MaskRay updated this revision to Diff 450458.Aug 5 2022, 6:27 PM
MaskRay retitled this revision from [WIP][ELF] Parallelize writes of different OutputSections to [ELF] Parallelize writes of different OutputSections.
MaskRay edited the summary of this revision. (Show Details)

The identified failures were actually all hidden bugs exposed by this patch.
All fixed now.

I am not really happy that the patch exposes some implementation details (i.e TaskGroup and requires a value of TaskSize) that are best to be hidden, but I, honestly, do not have a better idea ready. Maybe there can be a wrapper class that encloses a TaskGroup and where asyncParallelFor can be moved as a method? Maybe it even could automatically adjust TaskSize based on parallel::detail::MaxTasksPerGroup and the count of already added tasks?

lld/ELF/OutputSections.cpp
459–467

I cannot find a requirement for BYTE() commands to override the content of input sections. Isn't the script simply malformed in that case? Can't we add these writings to the pool too?

476

Where does the value of 128 come from? It looks like, with the hardcoded value, large projects will benefit from the parallelization more than small ones. And it also does not take into account the llvm::parallel::detail::MaxTasksPerGroup; can it somehow be derived from the common setting?

I am not really happy that the patch exposes some implementation details (i.e TaskGroup and requires a value of TaskSize) that are best to be hidden, but I, honestly, do not have a better idea ready. Maybe there can be a wrapper class that encloses a TaskGroup and where asyncParallelFor can be moved as a method? Maybe it even could automatically adjust TaskSize based on parallel::detail::MaxTasksPerGroup and the count of already added tasks?

TaskGroup is moved outside detail::. The idea is that its destructor is a suitable place for the gather place.

lld/ELF/OutputSections.cpp
459–467

We want to ensure that BYTE may overwrite a filler (fill(start, end - start, filler);). I can add a comment.

476

I don't think it can be derived from the common setting. It may not be bad to have a call site specific value like SmallVector. If we derived this automatically, tuning the library parameter may be more difficult.

I assume that the performance figures are based on running on Linux. Do you have any idea what the performance impact is on Windows?

lld/ELF/OutputSections.cpp
466

Given there is already a "live" TaskGroup, I don't think this will actually run in parallel, IIUC. However, this is a limitation of the current parallel implementation.

476

I would actually suggest that the task size could be even larger than 128 for output sections that have many input sections. I think as a "minimum" setting it feels about right. It's usually hard to derive a good task size value without incurring some cost in gathering appropriate metrics in order to make a good estimate.

Perhaps the task size could be calculated based on the number of sections and the number of threads in the task pool?

Not a lot to add over what has already been said. IIUC we do have code to detect when an Output Section overlaps and the user has to explicitly choose --noinhibit-exec to force the write? I think that non-deterministic output is reasonable in that case. Perhaps --noinhibit-exec could imply --threads=1 if we were concerned about people relying on order of output.

MaskRay updated this revision to Diff 451767.Aug 11 2022, 1:37 AM
MaskRay marked 4 inline comments as done.

derive a better taskSize

Add comment about limitation of llvm/Support/Parallel.h.

Not a lot to add over what has already been said. IIUC we do have code to detect when an Output Section overlaps and the user has to explicitly choose --noinhibit-exec to force the write? I think that non-deterministic output is reasonable in that case. Perhaps --noinhibit-exec could imply --threads=1 if we were concerned about people relying on order of output.

The --check-sections error almost assuredly suggests a fatal style error. The user can get an output with --noinhibit-exec. This is fair corner case and I think letting it non-deterministic is fine.
If we want to avoid that, we can change checkSections to always run and return a value whether we should disable async parallel write.

lld/ELF/OutputSections.cpp
476

Thanks. Changed to a simple heuristic.

I've managed to get some time to test this change on Windows and the results do not look good. Testing chrome from lld-speed-test.tar.xz, I get a ~1% improvement, so that's the good news. However, testing mozilla, I get a ~23% increase in link time and testing scylla, I get a ~140% increase in link time. Testing an Unreal Engine 4 based link, gives ~21% increase in link time. This is running on a Windows 10 PC with AMD Ryzen 3900X 12C/24T 64GB RAM with Samsung 970 EVO NVMe SSDs. Haven't had a chance to dig deeper into why this is having such a negative impact.

andrewng added inline comments.Aug 11 2022, 11:16 AM
lld/ELF/OutputSections.cpp
479

Didn't get much time to investigate the Windows performance degradation. However, lowering 256 to 16 appears to "fix" the issue for the test cases that I've tried so far. In all the "bad" cases, the performance is about the same or slightly better (1-3%). For a link of clang the improvement is ~9% for both values. Still don't really know the reasoning behind this behaviour.

MaskRay added a comment.EditedAug 11 2022, 11:27 AM

Appreciate the testing.

Some parallel* in SyntheticSections.cpp (e.g. MergeNoTailSection::writeTo) is now serial due to limitation of llvm/Support/Parallel.h, e.g. GdbIndexSection::writeTo, MergeNoTailSection::writeTo. Sometimes (in all workloads I have tested) overlapping their write with other output sections seems better than spending all threads in parallelizing them and writing output sections serially.

In case it is useful, I have tried parallel outputsection write last year (https://reviews.llvm.org/D116282). I don't find strict improvement so that patch stays the preview state.
The shouldParallel function there may be useful.

Also, --time-trace may be useful to analyze synchronous operations. For asyncParallelFor, te "Write ..." time may be significantly smaller.

ld.lld --time-trace --threads=8 @response.txt -o 1
jq -r '.traceEvents[] | select(.name|contains("Write")) | "\(.dur/1000000) \(.name) \(.args)"' < 1.time-trace

Some parallel* in SyntheticSections.cpp (e.g. MergeNoTailSection::writeTo) is now serial due to limitation of llvm/Support/Parallel.h, e.g. GdbIndexSection::writeTo, MergeNoTailSection::writeTo. Sometimes (in all workloads I have tested) overlapping their write with other output sections seems better than spending all threads in parallelizing them and writing output sections serially.

That is a shame. I think keeping the "parallel for" in the BYTE handling code is fine, I was just commenting that it wouldn't be parallel given the current circumstances. Hopefully, at some point the limitation in the parallel support can be improved.

Also, --time-trace may be useful to analyze synchronous operations. For asyncParallelFor, te "Write ..." time may be significantly smaller.

Unfortunately, the --time-trace didn't really tell me anything I didn't already know, i.e. the writing of the output sections is taking significantly longer, everything else is about the same.

However, I have isolated it to the .debug sections. My "fast" test cases didn't have any debug info, where as the "slow" test cases all had some debug info. Did you test with debug info? I think the key thing about debug info is that it's large and has relatively few input sections which is why the minimum task size figure makes such a significant difference. This would lead me to suggest that having multiple threads concurrently writing out different large output sections does not scale well on Windows.

I've looked a bit more at the Windows performance degradation and have come up with the following code for taskSize and asyncParallelFor:

size_t tasks = size / (4 * 1024 * 1024);
size_t taskSize = tasks ? sections.size() / tasks : sections.size();
asyncParallelFor(tg, std::max<size_t>(1, taskSize), 0, sections.size(), fn);

The 4MB is somewhat arbitrary but this appears to work OK on my Windows PC and mostly eliminates the performance degradation that I've seen so far. In fact there's a ~3% improvement for mozilla from lld-speed-test.tar.xz.

I've also tried to test on Linux, although only with an Ubuntu 22.04.1 VM on my Windows PC. I seem to see a similar performance degradation for scylla and mozilla (and the UE4 based link too). @MaskRay, could you please try testing scylla and mozilla to see if you can reproduce the performance degradation? The above patch also improves the situation for my setup and actually results in performance improvements for the problematic test cases.

Not really too sure what the next steps should be for this review. Parallel optimisations of this nature are always going to be somewhat tricky across platforms.

Instead of approximating the taskSize based on a 4MB size limit, I've prototyped a patch in https://reviews.llvm.org/D132025 that does it a bit better. This prototype patch has only been briefly tested in the same configurations as I've already tested this patch. However, I have also been able to reproduce the performance degradation on a different Windows PC.

MaskRay updated this revision to Diff 454244.Aug 20 2022, 12:43 PM
MaskRay edited the summary of this revision. (Show Details)

Use TaskGroup::execute

MaskRay marked an inline comment as done.Aug 20 2022, 12:43 PM
MaskRay added a comment.EditedAug 20 2022, 12:49 PM

I've looked a bit more at the Windows performance degradation and have come up with the following code for taskSize and asyncParallelFor:

size_t tasks = size / (4 * 1024 * 1024);
size_t taskSize = tasks ? sections.size() / tasks : sections.size();
asyncParallelFor(tg, std::max<size_t>(1, taskSize), 0, sections.size(), fn);

The 4MB is somewhat arbitrary but this appears to work OK on my Windows PC and mostly eliminates the performance degradation that I've seen so far. In fact there's a ~3% improvement for mozilla from lld-speed-test.tar.xz.

I've also tried to test on Linux, although only with an Ubuntu 22.04.1 VM on my Windows PC. I seem to see a similar performance degradation for scylla and mozilla (and the UE4 based link too). @MaskRay, could you please try testing scylla and mozilla to see if you can reproduce the performance degradation? The above patch also improves the situation for my setup and actually results in performance improvements for the problematic test cases.

Not really too sure what the next steps should be for this review. Parallel optimisations of this nature are always going to be somewhat tricky across platforms.

I confirm the scylladb regression on a Debian testing machine, using the previous sharding strategy. Thanks for suggesting the 4MiB task size strategy. I have switched to the strategy and it fixes the regression.

For the new Parallel API, I use std::function<void()> and place it in .cpp to avoid template code size bloat. I have tested that there is no observable slowdown compared with FuncTy Fn.


I do not know how to test a Firefox build.

curl https://hg.mozilla.org/mozilla-central/raw-file/default/python/mozboot/bin/bootstrap.py -O
python3 bootstrap.py
cd mozilla-unified
./mach build

gives me a 202MiB obj-x86_64-pc-linux-gnu/dist/bin/libxul.so. It is unclear how to rebuild it with -Wl,--reproduce=.

MaskRay edited the summary of this revision. (Show Details)Aug 20 2022, 1:34 PM

I've tested this latest patch applied to our downstream port on Windows and it improves linking performance in all our test cases that I've managed to test so far.

llvm/include/llvm/Support/Parallel.h
82

maybeSpawn -> execute.

llvm/lib/Support/Parallel.cpp
21–22

The patch doesn't compile with LLVM_ENABLE_THREADS=OFF. I think this guard needs to move down.

183–184

This guard needs to move up and I think TaskGroup::spawn above will need guards added too.

MaskRay updated this revision to Diff 454713.Aug 23 2022, 12:08 AM
MaskRay marked 3 inline comments as done.

fix !LLVM_ENABLE_THREADS build

LGTM but I think it would be good to get some other reviewers to take a look (and perhaps test) before approval.

MaskRay added a comment.EditedAug 23 2022, 4:11 PM

LGTM but I think it would be good to get some other reviewers to take a look (and perhaps test) before approval.

Thanks. Do you think a smaller task size like 1MiB / 2MiB work fine?

Just to say, no objections so far. I don't have a lot of easy access to targets that haven't been benchmarked on already. One that I do have is an M1 Macbook Air. I should be able to cross-compile Clang for AArch64 ELF on that. Although not an ideal machine for benchmarking it may give some info on how this performs on Apple Silicon.

On a M1 MacBook Air (8 GiB) using the lld speed test https://s3-us-west-2.amazonaws.com/linker-tests/lld-speed-test.tar.xz I didn't see any significant differences with the patch. Both Chromium and Clang GDB index were within a small fraction of a percentage faster. Mozilla showed the largest difference with the patch being about 3% faster. I think most of the results are that this doesn't make a lot of difference on an M1. Which isn't an ideal target to do benchmarking on, nor does it have many threads + some of the CPUs are efficiency cores. However it does at least show that this patch isn't going to harm results either.

LGTM for the benefits on other targets.

Thanks. Do you think a smaller task size like 1MiB / 2MiB work fine?

@MaskRay, I did briefly experiment with smaller task sizes and both 1MiB & 2MiB showed similar results on my 2 test PCs with Windows. The performance related issue on Windows occurred when the "task size" was too large and 4MiB appeared to be the "sweet spot" in my testing. If 1MiB or 2MiB can be shown to be beneficial for other platforms, then it should be fine for Windows too (at least in my testing).

LGTM for the benefits on other targets.

@peter.smith, thanks for testing on Apple M1.

Thanks for all the reviews and testing. Will push this shortly.

MaskRay edited the summary of this revision. (Show Details)Aug 24 2022, 9:30 AM
This revision was not accepted when it landed; it landed in state Needs Review.Aug 24 2022, 9:40 AM
This revision was automatically updated to reflect the committed changes.