Page MenuHomePhabricator

[LLD] Remove global state in lldCommon
ClosedPublic

Authored by aganea on Aug 27 2021, 2:50 PM.

Details

Summary

As discussed in our last Windows/COFF call, and as proposed by Reid in [1], this patch implements a first iteration to make LLD free of global initializers & global state. Here we only remove global state in lldCommon, however the same concept can be applied to other LLD libs.

The idea is to move all variables at file-scope or function-static-scope into a hosting structure (lld::CommonLinkingContext) that lives at lldMain()-scope. Drivers will inherit from this structure and add their own global state, in the same way as for the existing LinkingContext and MachOLinkingContext.

Before, code was directly referencing state in the .data section. It now goes through an indirection to a global **LinkingContext returned by lld::context(). In the future, we plan on supporting several instances of lld::**LinkingContext that would coexist in memory (when the hosting application is similar to D86351).

[1] https://lists.llvm.org/pipermail/llvm-dev/2021-June/151184.html

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I think this should be staged into two patches:

  1. LLD-only: includes all of the Context changes, saver(), etc, but the context is still kept in a global pointer.
  2. LTO.h and Parallel.h: make the context pointer thread local, allowing the possibility of two linker instances in the same process, and add the task wrapper API

The goal is to make the patch that changes Support as small as possible. MLIR uses the parallel APIs, so we should ask them (@mehdi_amini, @rriddle, maybe others) to review the changes to llvm/Support/Parallel.h.

lld/tools/lld/lld.cpp
183

This probably needs to be initialized or static analysis tools may complain, even though the invalid case is handled above.

Nice work!

lld/Common/ErrorHandler.cpp
96–97

I'm curious about this change: you're protecting what seems to be global methods with a non-global mutex. The code comment for the mutex says `

// The functions defined in this file can be called from multiple threads,
// but lld::outs() or lld::errs() are not thread-safe. We protect them using a
// mutex.
This made sense when the mutex was global, less so now. Can you clarify this? (and update the comment)

The goal is to make the patch that changes Support as small as possible. MLIR uses the parallel APIs, so we should ask them (@mehdi_amini, @rriddle, maybe others) to review the changes to llvm/Support/Parallel.h.

Actually MLIR does not use Parallel.h anymore, especially to avoid relying on any global. We're using instead a ThreadPool owned by the MLIRContext, the client applications can inject a thread pool in the context to manage sharing threads how they see fit.

Our wrappers are in https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/IR/Threading.h and operate around llvm/Support/ThreadPool.h.

The mechanism introduced in Parallel.h in this revision (wrapThreadState) seems mostly like some sort of "workaround" to be able to continue to rely on some global way to pass state around in lld instead of having something more explicit, which we favored in MLIR (where like in LLVM, the Context is fairly centric and not accessed through a global).

llvm/lib/Support/Parallel.cpp
87 ↗(On Diff #370678)

Seems like worth documenting.

aganea updated this revision to Diff 373349.Sep 17 2021, 3:04 PM
aganea marked 2 inline comments as done.
aganea edited the summary of this revision. (Show Details)

The mechanism introduced in Parallel.h in this revision (wrapThreadState) seems mostly like some sort of "workaround" to be able to continue to rely on some global way to pass state around in lld instead of having something more explicit, which we favored in MLIR (where like in LLVM, the Context is fairly centric and not accessed through a global).

It was totally a temporary workaround, until we explicitly pass contexts around. I've tried passing contexts around, but it is a big change and I'm not sure it's worth it. Removed this code for now.

I think this should be staged into two patches:

  1. LLD-only: includes all of the Context changes, saver(), etc, but the context is still kept in a global pointer.

Updated the patch to only do 1. like you suggested.

lld/Common/ErrorHandler.cpp
96–97

Before, the mutex and the functions were global. They both have the same lifetime as the call to lldMain(). I was planning on supporting several concurrent calls to lldMain() in the future. Updated the comment.

llvm/lib/Support/Parallel.cpp
87 ↗(On Diff #370678)

Removed.

rnk added a comment.Sep 17 2021, 3:41 PM

Actually MLIR does not use Parallel.h anymore, especially to avoid relying on any global. We're using instead a ThreadPool owned by the MLIRContext, the client applications can inject a thread pool in the context to manage sharing threads how they see fit.

Our wrappers are in https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/IR/Threading.h and operate around llvm/Support/ThreadPool.h.

The mechanism introduced in Parallel.h in this revision (wrapThreadState) seems mostly like some sort of "workaround" to be able to continue to rely on some global way to pass state around in lld instead of having something more explicit, which we favored in MLIR (where like in LLVM, the Context is fairly centric and not accessed through a global).

I see, I guess that is new as of June, so news to me. :)

I think there are really two key operations done throughout the linker that need the context:

  • memory allocation (via BumpPtrAllocator)
  • diagnostics (warnings & errors)

Conventional C++ programs use a global heap because passing around a pool or arena is too much work.

I think I would like to see central LLD context classes, but I'm open to using thread locals on the way to that restructuring.


As far as this iteration of the patch goes, I think it looks good, but I think other linker port owners should sign off on it too.

aganea updated this revision to Diff 373420.Sep 18 2021, 8:55 AM

Fix linter suggestions.

Ping.

lld/Common/ErrorHandler.cpp
96–97

I meant now, they both have the same lifetime as the call to lldMain().

Sorry for the slow review from the Mach-O end.

This seems like a pretty clean solution overall. I was concerned that removing global state would mean we have to thread a ton of context through our entire call stack, but switching to function calls instead is easy enough.

I'll measure the performance of this change this week; we care heavily about linking speed because it dominates incremental build speeds (so fast links can be a huge developer experience boost). I'm not expecting any change from the global variable access changes (especially with LTO), but I'm curious if switching the type-specific bump allocators to using a DenseMap for lookup will have any measurable effect.

I see a bunch of discussions about thread-locals; was that for an earlier version of this patch, or am I just missing something? (I don't see anything thread-local related in the current version of the patch.)

To make sure I'm understanding this correctly, can the calls to freeArena be removed because the destructor for the context will take care of that now?

I'd like @int3 and/or @thakis to take a look on the Mach-O side as well (since they're far more active), and I imagine @MaskRay is the best person for the ELF side.

lld/COFF/InputFiles.cpp
1037

What was the motivation for saving to a local variable here? Is it the use of saver in the loop below? (It makes sense to me; I just wanted to confirm my understanding.)

lld/Common/CommonLinkingContext.cpp
22 ↗(On Diff #373420)

It'd be nice if there was a way to put this in the header, so that it could be inlined even in non-LTO builds, but that'd prevent lctx from being file-static, which seems undesirable. I imagine LTO or ThinLTO would be happy to inline this (and I'll confirm when I do performance measurements), so it shouldn't matter too much in the end.

23 ↗(On Diff #373420)

I'm a bit concerned about the potential performance implications of this assert, but I'll measure first.

lld/Common/Memory.cpp
14–24

Can these be put in the header, so that they can be inlined even in non-LTO builds?

lld/include/lld/Common/Memory.h
26–28

This comment should probably be updated? The tracking is happening in the context now.

43–49

This comment needs to be updated (or perhaps just removed).

44

The contents don't matter at all, just the address (which is used as a key for the instances map inside the context), right? Might be helpful to have a comment to that effect, otherwise the introduction of what seems like new global state is confusing.

aganea updated this revision to Diff 396348.Dec 27 2021, 3:33 PM
aganea marked 7 inline comments as done.

Rebase.
Integrate with @akhuang's COFFLinkerContext.
Address comments from @smeenai.

Thanks for taking a look!

I see a bunch of discussions about thread-locals; was that for an earlier version of this patch, or am I just missing something? (I don't see anything thread-local related in the current version of the patch.)

Yes, please take a look at "Diff 4" and before, lctx used to be a THREAD_LOCAL to accommodate for several instances of LLD running in the same process. But @rnk suggested to go in steps, and first move all scattered statics into a global static context.

To make sure I'm understanding this correctly, can the calls to freeArena be removed because the destructor for the context will take care of that now?

Yes.

lld/COFF/InputFiles.cpp
1037

Yes, precisely, to clarify the intention that saver shouldn't change in that loop.

lld/Common/CommonLinkingContext.cpp
22 ↗(On Diff #373420)

Did you have the chance to run any tests?

23 ↗(On Diff #373420)

Since this would only trigger in LLVM_ENABLE_ASSERTIONS=ON builds, the performance shouldn't matter that much? Or should it? Otherwise we could only have it in LLVM_ENABLE_EXPENSIVE_CHECKS builds?

Will llvm::DenseMap<void *, SpecificAllocBase *> instances; make make<XXX> too slow?

For ELF, I have measured that symbol/section initialization time matters. As an alternative, lld/ELF can do bulk initialization instead.
If an ELF .o file has N sections, 1/3 are discarded sections (SHT_RELA/SHT_SYMTAB/etc) while 1/7 are discarded COMDAT sections.
When linking chrome, perhaps 6/7 are discarded COMDAT sections (too many templates?). It wastes some memory if we don't compute the number of needed sections, but that may be fine.

Will llvm::DenseMap<void *, SpecificAllocBase *> instances; make make<XXX> too slow?

I measured a previous version of this patch on Windows with COFF ​and a Chromium-sized output, and I only saw a very tiny departure (the fluctuations could be attributed to mmap'ing). Please see: https://reviews.llvm.org/D108850#2975472
Would you be able to try this patch on Linux linking Chromium?

MaskRay accepted this revision as: MaskRay.EditedDec 28 2021, 2:57 PM

The patch makes my Linux x86-64 lld (-DCMAKE_BUILD_TYPE=Release -DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_ENABLE_PIC=off -DCMAKE_CXX_FLAGS='-fno-asynchronous-unwind-tables') 130+KiB smaller.

I tried linking chrome and clang release/relwithdebinfo with this patch. No significant difference. Here is chrome's --threads=8 output

% hyperfine --warmup 2 --min-runs 20 "numactl -C 20-27 "{/tmp/c/0,/tmp/c/1}" -flavor gnu @response.txt --threads=8 -o lld.8"
Benchmark 1: numactl -C 20-27 /tmp/c/0 -flavor gnu @response.txt --threads=8 -o lld.8
  Time (mean ± σ):      6.699 s ±  0.032 s    [User: 8.815 s, System: 2.439 s]
  Range (min … max):    6.637 s …  6.745 s    20 runs
 
Benchmark 2: numactl -C 20-27 /tmp/c/1 -flavor gnu @response.txt --threads=8 -o lld.8
  Time (mean ± σ):      6.691 s ±  0.027 s    [User: 8.856 s, System: 2.396 s]
  Range (min … max):    6.647 s …  6.743 s    20 runs
 
Summary
  'numactl -C 20-27 /tmp/c/1 -flavor gnu @response.txt --threads=8 -o lld.8' ran
    1.00 ± 0.01 times faster than 'numactl -C 20-27 /tmp/c/0 -flavor gnu @response.txt --threads=8 -o lld.8'

Checked a large internal program, no difference.

Hope that https://maskray.me/blog/2021-12-19-why-isnt-ld.lld-faster#parallel-initialization-of-sections and parallel initialization of symbol tables are not harder to refactor :)

This revision is now accepted and ready to land.Dec 28 2021, 2:57 PM
MaskRay requested changes to this revision.Dec 28 2021, 3:08 PM

-DLLVM_USE_SANITIZER=Address has some leaks.

This revision now requires changes to proceed.Dec 28 2021, 3:08 PM
aganea updated this revision to Diff 396735.Dec 30 2021, 5:07 PM

Fix memory leaks.

MaskRay accepted this revision as: MaskRay.Dec 30 2021, 8:07 PM

LGTM.

lld/tools/lld/lld.cpp
165
178

If an error occurs, link() return false.

After r becomes int, return may be ambitious on whether it applies to the function or to the r variable.

179

You can make r an int and avoid repeated ? 1 : 0 below.

181

Call _Exit to skip destructors if exitEarly is true.


The original comment was

// Exit immediately if we don't need to return to the caller.
// This saves time because the overhead of calling destructors
// for all globally-allocated objects is not negligible.

Is it worth copying some words to the new comment?

MaskRay accepted this revision.Dec 30 2021, 8:08 PM
This revision is now accepted and ready to land.Dec 30 2021, 8:08 PM
aganea updated this revision to Diff 396816.Dec 31 2021, 11:41 AM
aganea marked 4 inline comments as done and an inline comment as not done.

As suggested by @MaskRay

MaskRay added inline comments.Jan 5 2022, 7:41 PM
lld/tools/lld/lld.cpp
139

static

MaskRay added inline comments.Jan 5 2022, 7:42 PM
lld/COFF/Driver.cpp
65

Just raw_ostream

aganea updated this revision to Diff 397974.Jan 6 2022, 1:14 PM
aganea marked 3 inline comments as done.

As suggested by @MaskRay

aganea added a comment.Jan 6 2022, 1:17 PM

@smeenai @int3 @thakis Any further comments? Can I go ahead with this?

I wasn't able to get the perf testing done at the time, and I won't have access to one of my testing machines for the next few days, but I'm not anticipating anything dramatic. If @int3 / @thevinster / @oontvoo / @thakis can run tests on their end, that's great, but this looks good to me regardless.

I wasn't able to get the perf testing done at the time, and I won't have access to one of my testing machines for the next few days, but I'm not anticipating anything dramatic. If @int3 / @thevinster / @oontvoo / @thakis can run tests on their end, that's great, but this looks good to me regardless.

Thanks for the headsup. Yeah, I can run a few bms - will share the results in a couple of hours

oontvoo accepted this revision.Jan 7 2022, 11:07 AM

perf-impact seems minimal to me. LGTM!

x base.txt
+ with_patch.txt

SYSTEM CPU time: 
    N           Min           Max        Median           Avg        Stddev
x  10          0.99          1.01          1.01         1.003   0.008232726
+  10          0.98          1.04          1.02         1.014   0.022705848
No difference proven at 95.0% confidence

USER CPU time: 
    N           Min           Max        Median           Avg        Stddev
x  10          3.56          3.61          3.57         3.575   0.018408935
+  10          3.56          3.71          3.62         3.625   0.053176854
Difference at 95.0% confidence
	0.05 +/- 0.0373876
	1.3986% +/- 1.04581%
	(Student's t, pooled s = 0.0397911)

WALL time: 
    N           Min           Max        Median           Avg        Stddev
x  10          4.06          4.12          4.08         4.085   0.024152295
+  10          4.04          4.23          4.14          4.14   0.070867639
Difference at 95.0% confidence
	0.055 +/- 0.0497434
	1.34639% +/- 1.21771%
	(Student's t, pooled s = 0.0529413)

Thanks for testing @oontvoo!

This revision was automatically updated to reflect the committed changes.

`@aganea This change broke SerializeToHsaco in MLIR, can you help resolve this issue or revert? See https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp#L440

I can't quite tell what the new boolean arguments mean.

`@aganea This change broke SerializeToHsaco in MLIR, can you help resolve this issue or revert? See https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp#L440

I can't quite tell what the new boolean arguments mean.

Fixing now.

yaxunl added a subscriber: yaxunl.Feb 7 2022, 9:21 PM

This patch caused regressions for HIP on Windows. Basically, any HIP programs using device variables will fail to link on Windows. The issue does not happen on Linux.

To reproduce,

$ cat a-hip-amdgcn-amd-amdhsa-gfx1031.s
        .text
        .amdgcn_target "amdgcn-amd-amdhsa--gfx1031"
        .p2alignl 6, 3214868480
        .fill 48, 4, 3214868480
        .protected      xxx                     ; @xxx
        .type   xxx,@object
        .data
        .globl  xxx
        .p2align        2
xxx:
        .long   123                             ; 0x7b
        .size   xxx, 4

        .ident  "clang version 14.0.0 (git@github.amd.com:Compute-Mirrors/llvm-project.git ce53ba9aaee8d053604883d5befaf07bad60ba84)"
        .section        ".note.GNU-stack"
        .addrsig
        .amdgpu_metadata
---
amdhsa.kernels:  []
amdhsa.target:   amdgcn-amd-amdhsa--gfx1031
amdhsa.version:
  - 1
  - 1
...

        .end_amdgpu_metadata

$  "C:\\hip\\bin\\clang++.exe" -cc1as -triple amdgcn-amd-amdhsa -filetype obj -target-cpu gfx1031 -mrelocation-model pic -mincremental-linker-compatible --mrelax-relocations -o a-hip-amdgcn-amd-amdhsa-gfx1031.o a-hip-amdgcn-amd-amdhsa-gfx1031.s

$ "C:\\hip\\bin\\lld.exe" -flavor gnu -shared a-hip-amdgcn-amd-amdhsa-gfx1031.o
lld: error: duplicate symbol: xxx
>>> defined at a-hip-amdgcn-amd-amdhsa-gfx1031.o:(xxx)
>>> defined at a-hip-amdgcn-amd-amdhsa-gfx1031.o:(.data+0x0)
aganea added a comment.Feb 8 2022, 8:55 AM

@yaxunl I have a fix, I will send a patch later today.