This is an archive of the discontinued LLVM Phabricator instance.

[ELF] Parallelize relocation scanning
ClosedPublic

Authored by MaskRay on Aug 31 2022, 1:38 AM.

Details

Summary
  • Change Symbol::flags to a std::atomic<uint16_t>
  • Add llvm::parallel::threadIndex as a thread-local non-negative integer
  • Add relocsVec to part.relaDyn and part.relrDyn so that relative relocations can be added without a mutex
  • Arbitrarily change -z nocombreloc to move relative relocations to the end. Disable parallelism for deterministic output.

MIPS and PPC64 use global states for relocation scanning. Keep serial scanning.

Speed-up with mimalloc and --threads=8 on an Intel Skylake machine:

  • clang (Release): 1.27x as fast
  • clang (Debug): 1.06x as fast
  • chrome (default): 1.05x as fast
  • scylladb (default): 1.04x as fast

Speed-up with glibc malloc and --threads=16 on a ThunderX2 (AArch64):

  • clang (Release): 1.31x as fast
  • scylladb (default): 1.06x as fast

Diff Detail

Event Timeline

MaskRay created this revision.Aug 31 2022, 1:38 AM
MaskRay requested review of this revision.Aug 31 2022, 1:38 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 31 2022, 1:38 AM
ikudrin added inline comments.Aug 31 2022, 11:45 AM
lld/ELF/Relocations.cpp
300–301

Why not define a copy constructor?

lld/ELF/Symbols.h
299–300

Why is not needsTlsGdToIe moved under atomic like needsTlsGd and alike?

315

You use it with two flags at least once, maybe call it setFlags?

MaskRay added inline comments.Aug 31 2022, 12:24 PM
lld/ELF/Symbols.h
299–300

All the 8 bits of std::atomic<uint8_t> have been used. We need one not in atomic if we want to keep the size of SymbolUnion unchanged.

ikudrin added inline comments.Sep 1 2022, 2:10 AM
lld/ELF/Symbols.h
299–300

Does that mean that some flags in the atomic do not really need to be handled as such, or that this flag is left outside despite it can be potentially updated concurrently, but there is no space for it in flags? In any case, that is worth documenting, at least.

MaskRay updated this revision to Diff 457877.Sep 4 2022, 6:23 PM
MaskRay marked 4 inline comments as done.
MaskRay edited the summary of this revision. (Show Details)

rebase. address comments

lld/ELF/Relocations.cpp
300–301

Good idea. Adopted

lld/ELF/Symbols.h
299–300

I have replaced Symbol::visibility with Symbol::stOther and atomic<uint16_t> is fine now, but I suspect 16-bit atomic operations are not efficient on common architectures.

MaskRay retitled this revision from [WIP][ELF] Parallelize relocation scanning to [ELF] Parallelize relocation scanning.Sep 4 2022, 6:29 PM
MaskRay updated this revision to Diff 457889.Sep 4 2022, 11:34 PM
MaskRay edited the summary of this revision. (Show Details)

rebase

lkail added a subscriber: lkail.Sep 4 2022, 11:53 PM
ikudrin added inline comments.Sep 5 2022, 7:11 AM
lld/ELF/Relocations.cpp
1241

If GotSection::hasGotOffRel and GotPltSection::hasGotPltOffRel are converted to atomic<bool>, the same should be done for Configuration::needsTlsLd because their usage pattern is similar.

1296

Shouldn't relocMutex be locked before this call?

1575

uint8_t -> uint16_t; not that it changes anything because the only flag that exceeds the range is NEEDS_TLSIE which is not used here, but still.

Sorry, I've been busy, so have only just had some time to look at this patch. Looks promising but unfortunately there are performance regressions on Windows for both chrome (~3%) and mozilla (~5%) from lld-speed-test.tar.xz. Don't yet know the reason for the slow down but I suspect it will be related to the "size" of the tasks being spawned in parallel.

Don't yet know the reason for the slow down but I suspect it will be related to the "size" of the tasks being spawned in parallel.

Had some time to investigate a bit more and it seems that the slow down, at least on my 12C/24T Windows PC, is actually a result of contention over relocMutex in RelocationScanner::processAux. So "too many" concurrent threads running RelocationScanner::processAux can result in an overall slow down to scan the relocations and in these cases, it's likely to slow down even further with more available threads. Unfortunately, there's no mechanism in parallel::TaskGroup to limit the number of concurrent tasks being run by the pool from the group, so there's no "easy" solution. I've been experimenting with some ideas that shard the input sections such that there are fewer concurrent threads running the relocation scanning code.

MaskRay marked 3 inline comments as done.EditedSep 6 2022, 6:02 PM

>>! In D133003#3772446, @andrewng wrote:

Don't yet know the reason for the slow down but I suspect it will be related to the "size" of the tasks being spawned in parallel.

Had some time to investigate a bit more and it seems that the slow down, at least on my 12C/24T Windows PC, is actually a result of contention over relocMutex in RelocationScanner::processAux. So "too many" concurrent threads running RelocationScanner::processAux can result in an overall slow down to scan the relocations and in these cases, it's likely to slow down even further with more available threads. Unfortunately, there's no mechanism in parallel::TaskGroup to limit the number of concurrent tasks being run by the pool from the group, so there's no "easy" solution. I've been experimenting with some ideas that shard the input sections such that there are fewer concurrent threads running the relocation scanning code.

Thanks for catching the issue. Perhaps we can add a thread_local thread index (for getDefaultExecutor) to llvm/Support/Parallel.h and allocate a relocation vector for each thread. Finally merge and sort the relocation vectors.

lld/ELF/Relocations.cpp
1575

Thanks for catching this!

andrewng added a comment.EditedSep 7 2022, 9:34 AM

I've created D133431 which is the result of my experimentation thus far. In my testing, it does slightly improve performance in the test cases that regressed in performance. In other test cases, it's around the same or slightly lower performance increase. However, I can't help but feel there should be a "better" solution. Although, I guess you've always got to balance that with complexity/maintainability. The hard coded concurrency limit of 8 tasks in D133431 also doesn't feel great.

Thanks for catching the issue. Perhaps we can add a thread_local thread index (for getDefaultExecutor) to llvm/Support/Parallel.h and allocate a relocation vector for each thread. Finally merge and sort the relocation vectors.

Yes, trying to eliminate the lock contention does sound like a good approach, although it feels like it would add complexity.

Also forgot to mention that there were 2 other ELF tests that seemed to need the --threads=1 treatment: comdat-discarded-error.s and debug-line-obj.s (although this might be due to the change in D133431).

MaskRay updated this revision to Diff 458972.EditedSep 9 2022, 12:44 AM
MaskRay marked an inline comment as done.
MaskRay edited the summary of this revision. (Show Details)

Remove mutex for relative relocations. Thanks to @andrewng for finding the issue

If the NEEDS_* change looks good, I'll pre-commit it (without using std::atomic) to reduce diff for future updates.

If the NEEDS_* change looks good, I'll pre-commit it (without using std::atomic) to reduce diff for future updates.

The NEEDS_* change LGTM.

This approach definitely looks better and hasn't added too much complexity. Initial testing on Windows is looking good, but I need to do a bit more.

lld/ELF/Relocations.cpp
1559

I wonder if it might be worthwhile using the previous code for the serial case? Although, it probably doesn't make a big difference to performance.

lld/ELF/Symbols.h
318

The argument name implies a single bit but perhaps add an assert, e.g. assert((bit & (bit - 1)) == 0)?

lld/ELF/SyntheticSections.h
546–547

Typo: should will -> will?

Is it worth adding the same comment to relocsVec in class RelrBaseSection?

llvm/lib/Support/Parallel.cpp
21

Perhaps int -> unsigned?

53

Perhaps move this initialisation of threadIndex and the one below into work()?

Performance on Windows looks good! Every test case I've tried has shown an improvement.

lld/ELF/SyntheticSections.cpp
1603–1606

Perhaps const auto &v? Same for RelrBaseSection::mergeRels().

MaskRay updated this revision to Diff 459206.Sep 9 2022, 2:39 PM
MaskRay marked 4 inline comments as done.

Thanks a lot for the comments.
Updated.

MaskRay added inline comments.Sep 9 2022, 2:46 PM
lld/ELF/Relocations.cpp
1543

I'll remove AndroidPackedRelocationSection does not support parallelism. . It works with deterministic parallelism.

1559

Use which piece of code for the serial case?

MaskRay edited the summary of this revision. (Show Details)Sep 9 2022, 11:45 PM
andrewng added inline comments.Sep 10 2022, 11:23 AM
lld/ELF/Relocations.cpp
1559

I was thinking this:

for (InputSectionBase *sec : inputSections)
  if (sec->isLive() && (sec->flags & SHF_ALLOC))
    scanner.template scanSection<ELFT>(*sec);

But on the other hand, in terms of future development and maintenance, it's probably better to use as much of the same code for both "paths", even if there's a minor performance penalty for the serial one.

1566–1571

This is running on the main thread. Is there a chance that this might clash with thread 0 of the task pool?

MaskRay updated this revision to Diff 459312.Sep 10 2022, 3:31 PM
MaskRay marked 2 inline comments as done.
MaskRay edited the summary of this revision. (Show Details)

reduce contention

lld/ELF/Relocations.cpp
1559

Yes, using the same code for both paths is better for maintenance.

1566–1571

Thanks for catching this. The main thread doing heavy work will contend with the thread pool. Changed to use tg.execute.

andrewng added inline comments.Sep 11 2022, 6:10 AM
lld/ELF/Relocations.cpp
1559

Yes, I think I agree. If it only affected the single threaded case, I wouldn't have mentioned it. But as there are specific configurations that are limited to serial I thought that it might be worth considering.

1566–1571

I think the previous code could have actually caused a threading issue, i.e. concurrent updates to the 0 indexed relocation vector. This will ensure that can't happen.

The only minor thing is it looks a little odd that the "serial" case uses tg but I guess it is still serial.

MaskRay marked an inline comment as done.Sep 11 2022, 11:00 AM
MaskRay added inline comments.
lld/ELF/Relocations.cpp
1559

Add the comment before the tg.execute([] { line?

+  // Both the main thread and thread pool index 0 use threadIndex==0.  Be
+  // careful that they don't concurrently run scanSections. When serial is
+  // true, fn() has finished at this point, so running execute is safe
andrewng accepted this revision.Sep 12 2022, 1:46 AM

This LGTM now, but I think it would be good to get another opinion too.

lld/ELF/Relocations.cpp
1559

Yes, I think it would be worth adding the comment for clarity.

This revision is now accepted and ready to land.Sep 12 2022, 1:46 AM

No objections from me. Some small suggestions for comments and a way that might catch someone using the relocs array before mergeRels has been called.

lld/ELF/SyntheticSections.h
506

Would it be better to move the text into the /// comment as it is a precondition for calling the function?

545–548

Now that mergeRels has to be called before this is useable, is it worth making this private with an interface that asserts mergeRels has been called?

546

Suggest "// will be moved into relocs by mergeRels()."

MaskRay updated this revision to Diff 459518.Sep 12 2022, 10:46 AM
MaskRay marked 5 inline comments as done.

Add concurrency to constructors and make RelocationBaseSection::relocsVec protected

MaskRay marked 3 inline comments as done.Sep 12 2022, 10:46 AM
MaskRay edited the summary of this revision. (Show Details)Sep 12 2022, 12:50 PM
This revision was automatically updated to reflect the committed changes.

Unfortunately, this commit broke mingw dylib builds with Windows native TLS. The reason for this is that with Windows native TLS, you can't directly access a TLS variable residing in a different DLL.

(Mingw setups that use emulated TLS doesn't have that drawback in itself. But GCC/binutils does occasionally have issues with non-static TLS variables accessed from multiple source files - such variables end up with a bunch of extra wrapper functions, which use weak linkage, which has a couple issues in GCC/binutils too; see D111779 where we avoided cross-translation-unit TLS variables in LLDB to avoid crashes when built with GCC.)

Is it possible to wrap the accesses to parallel::threadIndex into a wrapper function, i.e. like parallel::getThreadIndex()? I presume that would add a tiny bit of overhead, in a routine that we want to tune for performance anyway. We could have that wrapper be inline, in the case of non-Windows platforms (which should result in the same code generated, I guess) and be defined in Parallel.cpp for Windows cases. (There are build configurations on Windows where this wouldn't be strictly necessary, but the overhead is probably small enough that it's not worth the effort to try to distinguish all the individual cases.)

On z/OS this would also break the build because there is no support for TLS. To workaround this we have disabled LLVM_ENABLE_THREADS here https://github.com/llvm/llvm-project/blob/main/llvm/CMakeLists.txt#L478 . Would we be able to move the declaration inside #if LLVM_ENABLE_THREADS? Thanks in advance

hans added a subscriber: hans.Sep 21 2022, 8:18 AM

We're seeing non-deterministic build output after this change: https://bugs.chromium.org/p/chromium/issues/detail?id=1364380

hans added a comment.Sep 21 2022, 11:00 AM

We're seeing non-deterministic build output after this change: https://bugs.chromium.org/p/chromium/issues/detail?id=1364380

I've put an lld repro here: https://drive.google.com/file/d/19zRK4jUxghCA5Pg_OJUugLM-D4yZ7iQR/view?usp=sharing (1.4 GB, requires google.com login)

(The lack of thread_local problem on some configurations of Windows/zOS has been resolved.)

We're seeing non-deterministic build output after this change: https://bugs.chromium.org/p/chromium/issues/detail?id=1364380

I've put an lld repro here: https://drive.google.com/file/d/19zRK4jUxghCA5Pg_OJUugLM-D4yZ7iQR/view?usp=sharing (1.4 GB, requires google.com login)

Thanks for the reproduce. The nondeterminism is due to --pack-dyn-relocs=android.
I suspected whether it had trouble in a previous revision but after reading some code I thought it was ok.

I'll remove AndroidPackedRelocationSection does not support parallelism. . It works with deterministic parallelism.

So the section still has some problems. --pack-dyn-relocs=relr is deterministic from the many experiments I have done.

hans added a comment.Sep 27 2022, 6:19 AM

We're still seeing non-determinism after D133003. Did you verify that your change fixed the non-determinism in the repro tarball?

We're still seeing non-determinism after D133003. Did you verify that your change fixed the non-determinism in the repro tarball?

For the repro tarball, I've verified it's fixed.

while :; do fld.lld @response.txt --threads=4 -o 0; fld.lld @response.txt --threads=4 -o 1; cmp 0 1; done no output

hans added a comment.Sep 27 2022, 9:45 AM

We're still seeing non-determinism after D133003. Did you verify that your change fixed the non-determinism in the repro tarball?

For the repro tarball, I've verified it's fixed.

while :; do fld.lld @response.txt --threads=4 -o 0; fld.lld @response.txt --threads=4 -o 1; cmp 0 1; done no output

Okay, thanks. I'll see if I can provide some kind of reproducer for the new problem.

hans added a comment.Sep 28 2022, 5:24 AM

Sorry, turns out the bot which was failing hadn't picked up your change yet. I've verified that we're good locally, and also at tip-of-tree which includes the revert above.