This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lld/
-
Common/
1
CommonLinkerContext.cpp
7/19
Memory.cpp
-
include/lld/Common/
-
lld/
-
Common/
11/14
CommonLinkerContext.h
2/3
Memory.h

Differential D122922

[lld][common][lld-macho][lld-coff] Support per-thread allocators and StringSavers
Needs RevisionPublic

Authored by oontvoo on Apr 1 2022, 10:52 AM.

Download Raw Diff

Details

Reviewers

int3
MaskRay
mstorsjo
rnk
aganea
jyknight

Group Reviewers

Restricted Project

Summary

Details:

Provide thread-safe StringSaver and bAlloc by assigning each thread its own StringSaver and allocator.
Usage: code that might be run in different threads should call the new perThreadSaver() or makePerThread()
instead of the current util (saver() and make())

User can enable this unconditionally at compile time by setting -DTHREAD_SAFE_MEMORY

Some considerations:

This might have some perf/memroy impact.
TLS support might not be available on all archs. (Darwin?)

Benchmarks done on linking envoy and chromium for darwins showed no perf different when using the new allocators

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

oontvoo created this revision.Apr 1 2022, 10:52 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 1 2022, 10:52 AM

Herald added a subscriber: StephenFan. · View Herald Transcript

oontvoo requested review of this revision.Apr 1 2022, 10:52 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 1 2022, 10:52 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B157453: Diff 419806.Apr 1 2022, 10:53 AM

rebase

Harbormaster completed remote builds in B157456: Diff 419809.Apr 1 2022, 10:57 AM

squashed

oontvoo mentioned this in D121635: [lld][macho][elf] Teach the bump-allocator in lld/Common about thread-safetiness..Apr 1 2022, 11:32 AM

Harbormaster completed remote builds in B157463: Diff 419817.Apr 1 2022, 11:38 AM

int3 added inline comments.Apr 1 2022, 3:32 PM

lld/Common/Memory.cpp

I was hoping we could avoid mutexes altogether. I was thinking of something like

std::atomic<size_t> contextCount;
std::array<PerThreadContext *, MAX_THREADS> perThreadContexts;

if (threadTag == 0) {
  threadTag = contextCount++;
  perThreadContexts[threadTag] = new PerThreadContext;
}

MaskRay added inline comments.Apr 1 2022, 9:51 PM

lld/include/lld/Common/CommonLinkerContext.h
24	Delete unordered_map is very inefficient.
lld/include/lld/Common/Memory.h
98	PerThread may be better than ThreadSafe.

remoevd unused includes and rename make() func

Harbormaster completed remote builds in B157731: Diff 420179.Apr 4 2022, 7:42 AM

oontvoo added inline comments.Apr 4 2022, 10:29 AM

lld/Common/Memory.cpp
51	What is the value of MAX_THREADS? Doesn't that mean "max number of threads that can run concurrently" and not "max number of threads ever created during the application runtime"? As new threads are spawned up, the arrays can expand and that'd be a potential race condition, yes?
lld/include/lld/Common/CommonLinkerContext.h
24	(removed - it was never used...)

int3 added inline comments.Apr 4 2022, 11:13 AM

lld/Common/Memory.cpp
51	No, I meant "max number of threads ever created" :) Since we use threadpools, we shouldn't be creating that many threads over the application lifetime anyway. But you got me thinking if there's a nicer way to implement this that doesn't involve the user having to calculate that number up front. We could have a thread-local pointer to the per-thread context, so that checking if the context has been creating is a simple null check (as opposed to the current hashmap lookup.) We only take a lock the first time a thread executes and needs to create a new PerThreadContext, in order to safely add it to a global vector. To make it 100% lock-free, we could implement our own append-only dynamically expanding series of arrays, much like how the BumpPtrAllocator uses its Slabs. But that's probably overkill :)

keep an atomic incr to avoid locking

oontvoo added inline comments.Apr 4 2022, 1:40 PM

lld/Common/Memory.cpp
51	ok, gotcha! benchmarking showed no difference between the previous and the last approach (https://reviews.llvm.org/D123075) Which one do you prefer? I guess I liked this patch (updated diff) a bit better

Harbormaster completed remote builds in B157819: Diff 420297.Apr 4 2022, 2:35 PM

use thread_local, which is more portable than __thread

oontvoo edited the summary of this revision. (Show Details)Apr 5 2022, 1:22 PM

Harbormaster completed remote builds in B158058: Diff 420617.Apr 5 2022, 2:13 PM

fixed windows failures

oontvoo edited the summary of this revision. (Show Details)Apr 6 2022, 8:26 AM

Harbormaster completed remote builds in B158244: Diff 420881.Apr 6 2022, 9:26 AM

int3 added inline comments.Apr 6 2022, 12:24 PM

lld/Common/Memory.cpp
51	constexpr uint32_t MAX_THREADS = std::numeric_limits<int32>::max() - 1; doesn't that mean our std::array is now taking up like 32 MB 🤔 I was thinking of using a much lower number... but IMO the vector solution I suggested above is cleaner. we would need a lock when pushing onto the vector, but we'll only need to do it once per thread, and we wouldn't need to pre-allocate a whole bunch of memory benchmarking showed no difference between the previous and the last approach good to know! do we have any regression vs the non-thread-safe version?

You cannot query the maximum parallelism ondemand and then create the array?

You cannot query the maximum parallelism ondemand and then create the array?

It's a little fragile -- things would break if someone created more than one ThreadPool instance & gave each instance the max hardware parallelism

In D122922#3433979, @int3 wrote:

You cannot query the maximum parallelism ondemand and then create the array?

It's a little fragile -- things would break if someone created more than one ThreadPool instance & gave each instance the max hardware parallelism

But then MAX_THREADS has the same problem? For two thread pools I would need 2 * MAX_THREADS ?

right that's why I'm suggesting the use of a vector

oontvoo added inline comments.Apr 6 2022, 12:37 PM

lld/Common/Memory.cpp
51	but IMO the vector solution I suggested above is cleaner. we would need a lock when pushing onto the vector, but we'll only need to do it once per thread, and we wouldn't need to pre-allocate a > whole bunch of memory fair enough - updated the diff to that approach. do we have any regression vs the non-thread-safe version? also no difference (testing by unconditionally define THREAD_SAFE_MEMORY to 1. ) Question: Do we want to just enable it now? (and users who don't want it can turn it off or call the old functions (now renamed to have "unsafe" suffix)?

updated diff

int3 added inline comments.Apr 6 2022, 12:52 PM

lld/Common/Memory.cpp
51	also no difference (testing by unconditionally define THREAD_SAFE_MEMORY to 1. ) nice! Do we want to just enable it now? (and users who don't want it can turn it off or call the old functions (now renamed to have "unsafe" suffix)? sgtm. I would actually argue that we don't even need to keep the "unsafe" versions around. Folks can add it back if they ever have a use case for it (I can't imagine one atm.) But @MaskRay might want to chime in. Also, can we add someone who works on the COFF backend of LLD as a reviewer?
lld/include/lld/Common/CommonLinkerContext.h
25–26	can rm
35–36	can rm
42–43	nit: IMO it's not necessary to have `perThread` in these names, since they're already part of a class called `PerThreadContext` :)
69–70	not sure what this comment is about. what's the concern here?

addressed review comments

lld/include/lld/Common/CommonLinkerContext.h
69–70	that was from the previous impl (with the mapping). removed now

oontvoo retitled this revision from [lld][common][lld-macho] Support per-thread allocators and StringSavers to [lld][common][lld-macho][lld-coff] Support per-thread allocators and StringSavers.Apr 6 2022, 1:12 PM

oontvoo added a reviewer: Restricted Project.

oontvoo added a subscriber: rnk.

oontvoo added inline comments.

lld/Common/Memory.cpp
51	Do you know who should be added for COFF? (haven't followed closely .. maybe @rnk ?)

tschuett added inline comments.Apr 6 2022, 1:15 PM

lld/Common/Memory.cpp
55	Would an assert on `perThreadContexts.size() < FOO` be helpful?

oontvoo added inline comments.Apr 6 2022, 1:20 PM

lld/Common/Memory.cpp
55	the `.size()` means the number of threads ever created by this process. (note: it is NOT the number of threads that run concurrently). As such, I'm not sure we care what the limit is. (other than that if it runs out of memory it'll crash, but then we'd get a stacktrace and that's useful enough) Can you clarify why we should cap this?

My bad. I though of a last resort before the system crashes.

smeenai added subscribers: aganea, mstorsjo, smeenai.Apr 6 2022, 1:29 PM

smeenai added inline comments.

lld/Common/Memory.cpp
51	@aganea and @mstorsjo are also good contacts for the COFF side.

oontvoo edited the summary of this revision. (Show Details)Apr 6 2022, 1:31 PM

oontvoo added reviewers: mstorsjo, rnk, aganea.

rebase

Harbormaster completed remote builds in B158332: Diff 420999.Apr 6 2022, 3:33 PM

Hello! Thanks for adding me :-) Interesting challenge!
It's a bit sad that we have to do all this high-level/application-level memory management. Most modern allocators already support (lock-free) per-thread memory pools out-of-the-box, along with migration between threads and to the global pool/arena. This patch seems to be needed solely because we use BumpPtrAllocator/SpecificBumpPtrAllocator. I wonder how things would perform with just using malloc() instead + a modern underlying allocator (rpmalloc, mimalloc, ...). Memory locality brought by the bumpalloc is important, but it'd be interesting to compare benchmarks. FWIW there were discussions with @lattner at some point about integrating rpmalloc into the LLVM codebase, but I never got to post a RFC.

lld/Common/Memory.cpp
21	As things stand, you could also do `LLVM_THREAD_LOCAL PerThreadContext threadContext = nullptr;` However since this is really just like a `static` it goes against the "LinkerContext" concept, and since you already added some functionality in the context, you could just as well move this in `CommonLinkerContext`. You could do something like `ThreadLocal AllocContext;` (see llvm/include/llvm/Support/ThreadLocal.h). That would allocate a dynamic TLS slot on the first use, as opposed to `thread_local` which points to a pre-allocated slot in the PE static TLS table.
lld/include/lld/Common/CommonLinkerContext.h
37	I would rather name it considering what the struct does, not how it's used? `AllocationContext` maybe?
53	Why not use your new struct here, even when `LLD_THREAD_SAFE_MEMORY == 0`?
91	I agree that those "unsafe" functions might not be needed. The same applies to the two "perThread*" functions above. Can we just go through `saver()` and `bAlloc()`? Do we really want application code to explicitly choose either per-thread allocation pool or the global allocation pool?
104	`LLD_THREAD_SAFE_MEMORY` perhaps? Is this meant to be configured through cmake?
lld/include/lld/Common/Memory.h
74	Is there a need for "unsafe" or "perthread*"?

In D122922#3436091, @aganea wrote:

Hello! Thanks for adding me :-) Interesting challenge!
It's a bit sad that we have to do all this high-level/application-level memory management. Most modern allocators already support (lock-free) per-thread memory pools out-of-the-box, along with migration between threads and to the global pool/arena. This patch seems to be needed solely because we use BumpPtrAllocator/SpecificBumpPtrAllocator. I wonder how things would perform with just using malloc() instead + a modern underlying allocator (rpmalloc, mimalloc, ...). Memory locality brought by the bumpalloc is important, but it'd be interesting to compare benchmarks. FWIW there were discussions with @lattner at some point about integrating rpmalloc into the LLVM codebase, but I never got to post a RFC.

We did experimented with using a modern/threadsafe allocator here, specifically tcmalloc internally at google and found that it was pretty good (ie., no slower than the bump-ptr alloc). But conceptually the bump-alloc could be faster in simple cases so we thought it wouldn't hurt to use both when possible. (ie., have a thread safe bump allocator and a fast system malloc).

oontvoo added inline comments.Apr 7 2022, 8:04 AM

lld/Common/Memory.cpp
21	thanks! I've moved the variable to the LinkerContext class but wasn't able to use LLVM's ThreadLocal because I needed to hold on to the variable's address (to reset its value from another thread). Would this "just work" with ThreadLocal? ie: ThreadLocal<AllocationContext> currentThreadCtxt; std::vector<ThreadLocal<AllocationContext>> allThreadContexts; // collecting all the thread-local ctxt's addresses allThreadContext.push_back(&currentThreadCtxt); // in destroy(), reset them to sentinel value (nullptr) for (auto& localCtxt : allThreadContexts) { delete localCtxt.get(); localCtxt = nullptr; }
lld/include/lld/Common/Memory.h
74	I didn't want to change all the ports to the thread-safe version because I'm not familiar with all of them. (and was only able to do benchmarking for the macho port). If you all think it's "safe" to do this, then yeah, it'd simplify this patch a bit! :)

oontvoo added a reviewer: jyknight.Apr 7 2022, 10:02 AM

aganea added inline comments.Apr 7 2022, 10:03 AM

lld/Common/CommonLinkerContext.cpp

It is best if we had llvm::sys::ScopedWriter lock(contextMutex); here too.

lld/Common/Memory.cpp

It is not possible to take the address of the TLS slot -- one can only .get() or .set() the pointer in the ThreadLocal object. We needed that address to reset to sentinel in the destructor. But if moving ThreadLocal<AllocationContext> into CommonLinkerContext, we don't need to reset to a sentinel value anymore, since the TLS slot in all threads will die with CommonLinkerContext. If the TLS slot is reused later, it'll be reset to 0 by the system. So we could just use std::vector<AllocationContext *> allThreadContexts (plain pointer).

struct CommonLinkerContext {
  ThreadLocal<AllocationContext> currentThreadCtxt;
  std::vector<AllocationContext *> allThreadContexts;
  ...
};

AllocationContext *CommonLinkerContext::allocCtxt() {
  if (!threadContext.get()) {
    // Context didn't exist yet for this thread, so create a new one.
    auto *context = new AllocationContext;
    threadContext.set(context);

    llvm::sys::ScopedWriter lock(contextMutex);
    allThreadContexts.push_back(context);
  }
  return threadContext.get();
}

CommonLinkerContext::~CommonLinkerContext() {
  ...
  llvm::sys::ScopedWriter lock(contextMutex);
  for (AllocationContext *context : allThreadContexts) {
    for (auto &instance : context->instances)
      instance.second->~SpecificAllocBase();
    delete context;
  }
  ...
}

aganea added inline comments.Apr 7 2022, 10:12 AM

lld/Common/Memory.cpp
21	There's still a chance that an `AllocationContext` is in use when calling `~CommonLinkerContext()`, if all threads were not `.join`ed prior. But that should probably be solved at a higher level?

We did experimented with using a modern/threadsafe allocator here, specifically tcmalloc internally at google and found that it was pretty good (ie., no slower than the bump-ptr alloc). But conceptually the bump-alloc could be faster in simple cases so we thought it wouldn't hurt to use both when possible. (ie., have a thread safe bump allocator and a fast system malloc).

IIRC you found that tcmalloc-for-everything was at parity with bump ptr alloc + system allocator, but bump ptr allocator + tcmalloc together was still faster than tcmalloc-for-everything, right?

Addressed review comments:

Use llvm::sys::ThreadLocal instead of native thread_local (potential concern: the pthread_*() functions it uses might be a bit slower?)
Rename macro to LLD_THREAD_SAFE_MEMORY

oontvoo added inline comments.Apr 7 2022, 1:02 PM

lld/Common/Memory.cpp
21	There's still a chance that an AllocationContext is in use when calling ~CommonLinkerContext(), if all threads were not .joined prior. But that should probably be solved at a higher level? Yes, it's already a problem now with the saver()'s and bAlloc() 's return objects being used after the context has already been destroyed. (We can rely on asan/msan for catching this, no?)
lld/include/lld/Common/CommonLinkerContext.h
91	@MaskRay : any concern with removing this option? IIRC, you would like this option for performance reason.

Harbormaster completed remote builds in B158549: Diff 421308.Apr 7 2022, 1:41 PM

lattner removed a subscriber: lattner.Apr 7 2022, 1:44 PM

In D122922#3436866, @int3 wrote:

We did experimented with using a modern/threadsafe allocator here, specifically tcmalloc internally at google and found that it was pretty good (ie., no slower than the bump-ptr alloc). But conceptually the bump-alloc could be faster in simple cases so we thought it wouldn't hurt to use both when possible. (ie., have a thread safe bump allocator and a fast system malloc).

IIRC you found that tcmalloc-for-everything was at parity with bump ptr alloc + system allocator, but bump ptr allocator + tcmalloc together was still faster than tcmalloc-for-everything, right?

Right - that's correct. :) Thanks for clarifying that. (except with the major caveat that tcmalloc doesn't officially support darwins)

The goal if this patch, though, isn't necessarily to speed up the allocators - it's simply to make it thread safe and not slower.

Looks like the other comments have been addressed. I will stamp it at EOD today if no one objects

lld/include/lld/Common/CommonLinkerContext.h
91	let's remove it for now, folks can always add it back if they want to use it let's drop the "perThread" prefixes too

removed the unneeded helpers and define the cmake option for LLD_THREAD_SAFE_MEMORY

Herald added a subscriber: mgorny. · View Herald TranscriptApr 11 2022, 1:13 PM

Harbormaster completed remote builds in B159083: Diff 422020.Apr 11 2022, 1:33 PM

lgtm. Let's update the commit message, in particular

Usage: code that might be run in different threads should call the new perThreadSaver() or makePerThread()

instead of the current util (saver() and make())

also

TLS support might not be available on all archs. (Darwin?)

Is this still an ongoing concern? (Why do you think TLS support might not be available on Darwin?)

lld/CMakeLists.txt
228 ↗	(On Diff #422020)	I'm generally not a fan of adding compile-time options as I think it increases the number of potentially poorly-tested code paths. Maybe we could just use `LLD_ENABLE_THREADS` to gate the code currently protected by `LLVM_THREAD_SAFE_MEMORY`. But I guess having things behind a compile-time flag for now will make this diff easier to land. It's conceivable that some buildbots may be unhappy with this depending on e.g. the target-specific support for TLS. Can we follow up with a diff that makes the thread-safe behavior the default? If that passes all the buildbots, I think we should make it the default behavior.

This revision is now accepted and ready to land.Apr 11 2022, 5:54 PM

MaskRay added inline comments.Apr 11 2022, 8:30 PM

lld/Common/Memory.cpp
21	We can use LLVM_THREAD_LOCAL instead of llvm/Support/ThreadLocal.h. llvm/Support/ThreadLocal.h uses pthread_getspecific which is slower than TLS. LLVM_THREAD_LOCAL has been used in llvm/lib/Support and clang.

In the LLD_THREAD_SAFE_MEMORY code path, make<Foo>(...) now has significantly larger overhead due to the pthread_getspecific call (via llvm/Support/ThreadLocal.h).
A large number of make<...> instantiations in lld do not benefit from per-thread allocation.
So my thought (I mentioned this somewhere) is that we introduce a new make utility, let the few that actually benefit use it, instead of hijacking all make<...> to be pure per-thread.

If we trace the logic in a debugger, there is a surge of abstraction costs. A make call needs:

call CommonLinkerContext.cpp lld::commonContext() to get the global variable lctx (there is a TODO that it may be thread_local)
in CommonLinkerContext::perThreadAllocContext(), call pthread_getspecific to get the thread-specific data key
in lld::SpecificAllocBase::getOrCreatePerThread, threadContext->instances[tag] retrieves the instance from a DenseMap

I'd hope we step back and think whether the overhead can be reduced.

In D122922#3436091, @aganea wrote:

Hello! Thanks for adding me :-) Interesting challenge!

It's a bit sad that we have to do all this high-level/application-level memory management. Most modern allocators already support (lock-free) per-thread memory pools out-of-the-box, along with migration between threads and to the global pool/arena. This patch seems to be needed solely because we use BumpPtrAllocator/SpecificBumpPtrAllocator. I wonder how things would perform with just using malloc() instead + a modern underlying allocator (rpmalloc, mimalloc, ...). Memory locality brought by the bumpalloc is important, but it'd be interesting to compare benchmarks. FWIW there were discussions with @lattner at some point about integrating rpmalloc into the LLVM codebase, but I never got to post a RFC.

I think the most important reason that we need llvm::SpecificBumpPtrAllocator<T> is for its destructors.
In lld, we allocate many objects with non-trivial destructors. make<T>(...) allows us to be lazy and not think of the destructor.
I have actually replaced many make<T>(...) singleton usage in lld/ELF with std::make_unique<T>(...) to save code size.

The second benefit is application-level memory management is more efficient when prudently used.
Whatever lock-free scheme is used, the system malloc will have higher overhead than a use case where objects are rarely used (bump allocator).
Very few classes in lld actually need this (e.g. Symbol, Section, possibly InputFile), but we abused make<T>(...) for almost everything because of its ease of use.

oontvoo added inline comments.Apr 12 2022, 6:27 AM

lld/Common/Memory.cpp
21	yeah, I've noted that pthread_getspecific could be a bit problematic but in practice I didn't see any performance impact when running the benchmarks. The advantage of using the ThreadLocal class is that we could put the "tag" inside the context class and that's better encapsulation. WDYT?

oontvoo added inline comments.Apr 12 2022, 6:35 AM

lld/CMakeLists.txt
228 ↗	(On Diff #422020)	(sorry - missed this comment previously) Good idea on using the `LLD_ENABLE_THREADS`! The reason I added this additional variable was because I wanted to give other ports an option to opt-out of this if they choose to, for whatever reasons. So if no one objects to enabling this unconditionally (or via the LLD_ENABLE_THREADS), then happy to remove this.

In D122922#3444540, @MaskRay wrote:

In the LLD_THREAD_SAFE_MEMORY code path, make<Foo>(...) now has significantly larger overhead due to the pthread_getspecific call (via llvm/Support/ThreadLocal.h).
A large number of make<...> instantiations in lld do not benefit from per-thread allocation.

Making individual pieces of code make this distinction (ie., thread safe vs no thread safe) is a bit bug prone. ie.,. a piece of code initially thinks that it doesnt need to be thread safe, but later ends up on a multi-thread code path.. There are a fair number of cases like this in MachO.
That's why we thought it's safer to have one setting: threadsafe for all or not thread safe for all.

Having said that, I understand your concern wrt performance in ELF. So how about offering these:

make<>() : can alias to either of the following depending on a flag
makeUnsafe<>(): Always use the global/shared allocator
makeThreadSafe<>() Always use thread-local allocators

This way, ELF can retain its current behaviour. MachO can choose to be all-thread safe if it wants.

Re: native TLS vs LLVM's ThreadLocal

As noted in the inline comment, I agree that it seems like it could be slower but benchmarks didn't show any difference. I don't have a strong preference either way (except that LLVM"s ThreadLocal simplifies the code a bit ....)
So if anyone absolutely has a strong objection to either approach, please raise your concern now. If not, I will go with the more popular recommendation on this patch.

MaskRay added inline comments.Apr 12 2022, 10:35 AM

lld/Common/Memory.cpp
21	Well, I observed a close to 1% regression when linking chromium :(

put back differet make/saver options:

make<>() behaviour remained unchanged (since the macro that controls it is OFF by default.)
added back the other two explicit makeUnsafe and makePerThread

Can someone summarize the high level roadmap of this feature? If threadsafe allocation is going to be optional/opt-in, lld can't generally rely on that, right? Or is it meant as a gradual way of merging the code and trying it out, and after it's ready to be enabled unconditionally, it can be relied upon in lld in general?

(So far I would expect that lld doesn't do any allocations from threaded code? And for the recent/upcoming refactorings to make lld usable as a library, the per-session context should hold the allocator, right?)

Secondly, about the choice of mechanism for the thread local data, I'm not familiar with the implications of llvm::sys::ThreadLocal, but for plain C++ thread_local, it's generally preferable if the thread local variable is made e.g. static within one single source file (with an accessor), instead of being declared in a header, accessed from any translation unit. If being accessed directly from multiple translation units, the use of the thread local variable incurs some amount of extra tls wrapper functions and weak symbols, which are occasionally broken in e.g. GCC on Windows. (See e.g. D111779, where a TLS variable in LLDB was moved this way, to fix GCC builds of it.)

Harbormaster completed remote builds in B159296: Diff 422304.Apr 12 2022, 11:54 AM

lattner removed a subscriber: lattner.Apr 12 2022, 12:29 PM

reverted threadtag back native TLS

! In D122922#3446376, @mstorsjo wrote:

Can someone summarize the high level roadmap of this feature? If threadsafe allocation is going to be optional/opt-in, lld can't generally rely on that, right? Or is it meant as a gradual way of merging the code and trying it out, and after it's ready to be enabled unconditionally, it can be relied upon in lld in general?

Yes, as it stands, the threadsafety is controlled by the new LLD_THREAD_SAFE_MEMORY. The intention is for each LLD ports to decide whether they want threadsafe allocator/saver (and not for *users* of lld).
For MachO, I think we'd like to enable it (already been hit by a few race-condition bugs). Can't speak for ELF/others ...

(So far I would expect that lld doesn't do any allocations from threaded code?

why not? can you expand on this?

And for the recent/upcoming refactorings to make lld usable as a library, the per-session context should hold the allocator, right?)

not super familiar with this effort. What is the scope of this "per-session" ? Is is per-process? If so, it doesn't solve the problem (here we have multiple threads using the same bptrAlloc, hence the race condition)

Secondly, about the choice of mechanism for the thread local data, I'm not familiar with the implications of llvm::sys::ThreadLocal, but for plain C++ thread_local, it's generally preferable if the thread local variable is made e.g. static within one single source file (with an accessor), instead of being declared in a header, accessed from any translation unit. If being accessed directly from multiple translation units, the use of the thread local variable incurs some amount of extra tls wrapper functions and weak symbols, which are occasionally broken in e.g. GCC on Windows. (See e.g. D111779, where a TLS variable in LLDB was moved this way, to fix GCC builds of it.)

Right, the native TLS variable would be static (not in a header). Pls See updated diff.

Harbormaster completed remote builds in B159308: Diff 422315.Apr 12 2022, 12:59 PM

In D122922#3446489, @oontvoo wrote:

! In D122922#3446376, @mstorsjo wrote:

Can someone summarize the high level roadmap of this feature? If threadsafe allocation is going to be optional/opt-in, lld can't generally rely on that, right? Or is it meant as a gradual way of merging the code and trying it out, and after it's ready to be enabled unconditionally, it can be relied upon in lld in general?

Yes, as it stands, the threadsafety is controlled by the new LLD_THREAD_SAFE_MEMORY. The intention is for each LLD ports to decide whether they want threadsafe allocator/saver (and not for *users* of lld).
For MachO, I think we'd like to enable it (already been hit by a few race-condition bugs). Can't speak for ELF/others ...

Right, but if the MachO part of lld does allocations from multiple threads, then it's essentially unreliable unless this option is enabled? So anyone building lld wanting to use the MachO part of it would need to enable it.

(So far I would expect that lld doesn't do any allocations from threaded code?

why not? can you expand on this?

I'm mostly familiar with the COFF parts of lld, and there, most of the linker logic is serial, and parallelism is only used for short blocks of parallelForEach or parallelSort, where no allocations are assumed to be done.

And for the recent/upcoming refactorings to make lld usable as a library, the per-session context should hold the allocator, right?)

not super familiar with this effort. What is the scope of this "per-session" ? Is is per-process? If so, it doesn't solve the problem (here we have multiple threads using the same bptrAlloc, hence the race condition)

No, it's meant so that you can use lld as a library, so that one process can have multiple threads running in parallel, where more than one such thread can call e.g. lld::coff::link() at the same time. I haven't followed the state of the work on that in very close detail, but as far as I understand, the implementation strategy has been to gather everything that previously was global, into a per-invocation context, that is passed around (and/or stored as a thread local variable somewhere I think). So within each linker invocation, you'd only ever have allocations happening on one thread, the one where the user called lld::*::link() - each such linker job then would run multiple threads for occasional parallelism in the linking process though.

Does the MachO linker run longer lived threads within one linker invocation, where any of them can do allocations?

I think it'd be valuable to align these two thread safety efforts so we don't end up with two mechanisms for doing the same. I think @aganea has been involved in that effort though. @aganea - can you chime in here?

Secondly, about the choice of mechanism for the thread local data, I'm not familiar with the implications of llvm::sys::ThreadLocal, but for plain C++ thread_local, it's generally preferable if the thread local variable is made e.g. static within one single source file (with an accessor), instead of being declared in a header, accessed from any translation unit. If being accessed directly from multiple translation units, the use of the thread local variable incurs some amount of extra tls wrapper functions and weak symbols, which are occasionally broken in e.g. GCC on Windows. (See e.g. D111779, where a TLS variable in LLDB was moved this way, to fix GCC builds of it.)

Right, the native TLS variable would be static (not in a header). Pls See updated diff.

Thanks, so LLVM_THREAD_LOCAL AllocContext *threadContext = nullptr; should indeed work just fine.

In D122922#3446555, @mstorsjo wrote:

In D122922#3446489, @oontvoo wrote:

not super familiar with this effort. What is the scope of this "per-session" ? Is is per-process? If so, it doesn't solve the problem (here we have multiple threads using the same bptrAlloc, hence the race condition)

No, it's meant so that you can use lld as a library, so that one process can have multiple threads running in parallel, where more than one such thread can call e.g. lld::coff::link() at the same time. I haven't followed the state of the work on that in very close detail, but as far as I understand, the implementation strategy has been to gather everything that previously was global, into a per-invocation context, that is passed around (and/or stored as a thread local variable somewhere I think). So within each linker invocation, you'd only ever have allocations happening on one thread, the one where the user called lld::*::link() - each such linker job then would run multiple threads for occasional parallelism in the linking process though.
[...]
I think it'd be valuable to align these two thread safety efforts so we don't end up with two mechanisms for doing the same. I think @aganea has been involved in that effort though. @aganea - can you chime in here?

There are two goals:

The first is the "library-ification", to allow executing any number of sequential "link sessions" from a third-party app (or even from within lld.exe). See discussions in D108850, D110450 and D119049. We need all state related to a "link session" to be contained into a "context", the CommonLinkerContext. Each LLD driver would then implement its own derived class, see COFFLinkerContext for example. This means any state cannot be global, including thread_local (but llvm::sys::ThreadLocal is fine).

The second is related to multithreaded cooperation, when several "link sessions" will be running in parallel in a single process. The is mainly for D86351, but it could be useful for a LLD-as-a-daemon server. We can discuss all this later, but I think the containing application should host a single ThreadPool and pass it as a reference to lld::safeLldMain(). Tasks from any "link session" would then be then queued on the same global ThreadPool.

Secondly, about the choice of mechanism for the thread local data, I'm not familiar with the implications of llvm::sys::ThreadLocal, but for plain C++ thread_local, it's generally preferable if the thread local variable is made e.g. static within one single source file (with an accessor), instead of being declared in a header, accessed from any translation unit. If being accessed directly from multiple translation units, the use of the thread local variable incurs some amount of extra tls wrapper functions and weak symbols, which are occasionally broken in e.g. GCC on Windows. (See e.g. D111779, where a TLS variable in LLDB was moved this way, to fix GCC builds of it.)

Right, the native TLS variable would be static (not in a header). Pls See updated diff.

Thanks, so LLVM_THREAD_LOCAL AllocContext *threadContext = nullptr; should indeed work just fine.

This will not satisfy point 2 above. This would prevent concurrent CompilerLinkerContexts, LLVM_THREAD_LOCAL acts like a per-thread static. Calls to make<>() would add on the same thread-local BumpPtrAllocator, regardless of the CompilerLinkerContext. The llvm::sys::ThreadLocal solves that by allocating & freeing a separate TLS slot for each CompilerLinkerContext. The code in ~CommonLinkerContext() will otherwise not work. We can fix this later if you wish, but it all defeats the purpose of CompilerLinkerContext.

The 1% divergence that @MaskRay mentionned could be fixed if we passed AllocContext where necessary?

In D122922#3446489, @oontvoo wrote:

! In D122922#3446376, @mstorsjo wrote:
Secondly, about the choice of mechanism for the thread local data, I'm not familiar with the implications of llvm::sys::ThreadLocal, but for plain C++ thread_local, it's generally preferable if the thread local variable is made e.g. static within one single source file (with an accessor), instead of being declared in a header, accessed from any translation unit. If being accessed directly from multiple translation units, the use of the thread local variable incurs some amount of extra tls wrapper functions and weak symbols, which are occasionally broken in e.g. GCC on Windows. (See e.g. D111779, where a TLS variable in LLDB was moved this way, to fix GCC builds of it.)

Right, the native TLS variable would be static (not in a header). Pls See updated diff.

Worth noting that ThreadLocal is holding a dynamically-allocated index into a runtime TLS table, whereas thread_local effectively adds space into the PE's static TLS table. llvm::sys::ThreadLocal does not suffer from the issue mentioned by @mstorsjo.

lld/include/lld/Common/CommonLinkerContext.h
56	Change to `globalContext`, as opposed to `perThreadContext`?

I've posted an alternate implementation, please see D123879:

The main differences are:

LLD_THREAD_SAFE_MEMORY was removed

AllocContext is always thread-local.

Using llvm::sys::ThreadLocal to make TLS allocation dynamic at runtime. This is to accommodate for several instances of CommonLinkerContext running concurrently.

No "safe" or "perThread" functions, the APIs remain the same as before.

I did not see any divergence in performance (on Windows) when using a two-stage LLD, built with -DLLVM_INTEGRATED_CRT_ALLOC=rpmalloc, with ThinLTO & -march=native.

I think it's a bit unfortunate that the library-fication work here is making this change hard to land. @thakis and I brought up this exact concern in the original thread (https://discourse.llvm.org/t/rfc-revisiting-lld-as-a-library-design/58445/)...

It doesn't help that we don't have a good set of benchmarks across all 3 platforms & a CI system to run them. Differences in ad-hoc local measurements are hard to resolve.

I wonder if we could move forward with the LLVM_THREAD_LOCAL approach for now in order to unblock further work on parallelizing LLD. I'm sure we can put together enough parallelization wins in LLD-MachO to dwarf a 1% overhead, but that will probably take a few months. I don't know as much about the other LLD ports but I assume there are similar opportunities. It will be easier to eat a regression later after it has "paid for itself" with wins.

Also, if we really wanted to lock in the win from using thread-locals, I reckon we could use some clever macros to switch between LLVM_THREAD_LOCAL and ThreadLocal at compile time, depending on whether we are building a standalone or a library.

In summary, my asks are: 1. can we ship the THREAD_LOCAL version for now and fix it later? 2. In the meantime, could we start some discussions on a set of benchmarks to make future modifications of cross-platform LLD code easier to analyze?

@int3 I don't want to block this, I understand your needs, and I value the intention of this patch. I'd be happy to reintroduce llvm::sys::ThreadLocal later. However I wasn't able to reproduce the 1% regression (see D123879 for numbers), although I don't deny it could exist in some configurations (depending on the machine, OS, LLVM build options, allocator). I'm just wondering if 1% regression isn't acceptable for now in the trunk, so that we can commit a more frictionless version of this patch.

In D122922#3446555, @mstorsjo wrote:

In D122922#3446489, @oontvoo wrote:

! In D122922#3446376, @mstorsjo wrote:

Can someone summarize the high level roadmap of this feature? If threadsafe allocation is going to be optional/opt-in, lld can't generally rely on that, right? Or is it meant as a gradual way of merging the code and trying it out, and after it's ready to be enabled unconditionally, it can be relied upon in lld in general?

Yes, as it stands, the threadsafety is controlled by the new LLD_THREAD_SAFE_MEMORY. The intention is for each LLD ports to decide whether they want threadsafe allocator/saver (and not for *users* of lld).
For MachO, I think we'd like to enable it (already been hit by a few race-condition bugs). Can't speak for ELF/others ...

Right, but if the MachO part of lld does allocations from multiple threads, then it's essentially unreliable unless this option is enabled? So anyone building lld wanting to use the MachO part of it would need to enable it.

This question was left unanswered -- how do you plan on using LLD_THREAD_SAFE_MEMORY? Will it be enabled only when building for Darwin? What happens for cross-compilation or if LLD_THREAD_SAFE_MEMORY isn't enabled? The make/makePerThread/makeUnsafe APIs seem a bit error-prone to me. Ideally it is a choice that should be avoided, if possible, by the business logic/driver developer.

It'd be nice if other drivers owners would pitch in to accept/veto this patch. @sbc100 @mstorsjo @MaskRay

In D122922#3467647, @aganea wrote:

Right, but if the MachO part of lld does allocations from multiple threads, then it's essentially unreliable unless this option is enabled? So anyone building lld wanting to use the MachO part of it would need to enable it.

This question was left unanswered -- how do you plan on using LLD_THREAD_SAFE_MEMORY? Will it be enabled only when building for Darwin? What happens for cross-compilation or if LLD_THREAD_SAFE_MEMORY isn't enabled? The make/makePerThread/makeUnsafe APIs seem a bit error-prone to me. Ideally it is a choice that should be avoided, if possible, by the business logic/driver developer.

Sorry, forgot to follow up on this. From offline chat with int3, I think decided to remove this and to just use the existing LLD_ENABLE_THREADS instead. That is to say, code would have thread safe saver/make() by default, unless it goes out of its way to call the unsafe() variants.

It also wasn't clear from @MaskRay's comment whehter the 1% slow down was for llvm::ThreadLocal alone or if it also applied to the native TLS approach. Was not able to reproduce the regression from my ends with either approach, and I didn't want to keep updating the patch .

In D122922#3467683, @oontvoo wrote:

In D122922#3467647, @aganea wrote:

Right, but if the MachO part of lld does allocations from multiple threads, then it's essentially unreliable unless this option is enabled? So anyone building lld wanting to use the MachO part of it would need to enable it.

This question was left unanswered -- how do you plan on using LLD_THREAD_SAFE_MEMORY? Will it be enabled only when building for Darwin? What happens for cross-compilation or if LLD_THREAD_SAFE_MEMORY isn't enabled? The make/makePerThread/makeUnsafe APIs seem a bit error-prone to me. Ideally it is a choice that should be avoided, if possible, by the business logic/driver developer.

Sorry, forgot to follow up on this. From offline chat with int3, I think decided to remove this and to just use the existing LLD_ENABLE_THREADS instead. That is to say, code would have thread safe saver/make() by default, unless it goes out of its way to call the unsafe() variants.

It also wasn't clear from @MaskRay's comment whehter the 1% slow down was for llvm::ThreadLocal alone or if it also applied to the native TLS approach. Was not able to reproduce the regression from my ends with either approach, and I didn't want to keep updating the patch .

I have a chrome build on Linux x86-64. Say /tmp/c/0 has a lld built from main branch. /tmp/c/1 is a lld built with this patch.
I have measured this:

% hyperfine --warmup 2 --min-runs 16 "numactl -C 20-27 "{/tmp/c/0,/tmp/c/1}" -flavor gnu @response.txt --threads=8"                                           
Benchmark 1: numactl -C 20-27 /tmp/c/0 -flavor gnu @response.txt --threads=8
  Time (mean ± σ):      5.584 s ±  0.035 s    [User: 9.164 s, System: 2.318 s]                                                                                 
  Range (min … max):    5.536 s …  5.656 s    16 runs                          
                                       
Benchmark 2: numactl -C 20-27 /tmp/c/1 -flavor gnu @response.txt --threads=8                                                                                   
  Time (mean ± σ):      5.637 s ±  0.048 s    [User: 9.205 s, System: 2.305 s]
  Range (min … max):    5.565 s …  5.765 s    16 runs
  
Summary
  'numactl -C 20-27 /tmp/c/0 -flavor gnu @response.txt --threads=8' ran
    1.01 ± 0.01 times faster than 'numactl -C 20-27 /tmp/c/1 -flavor gnu @response.txt --threads=8'
% hyperfine --warmup 2 --min-runs 16 "numactl -C 20-23 "{/tmp/c/0,/tmp/c/1}" -flavor gnu @response.txt --threads=4"                                           
Benchmark 1: numactl -C 20-23 /tmp/c/0 -flavor gnu @response.txt --threads=4
  Time (mean ± σ):      6.227 s ±  0.044 s    [User: 8.547 s, System: 2.122 s]
  Range (min … max):    6.165 s …  6.353 s    16 runs
 
Benchmark 2: numactl -C 20-23 /tmp/c/1 -flavor gnu @response.txt --threads=4
  Time (mean ± σ):      6.260 s ±  0.033 s    [User: 8.605 s, System: 2.096 s]
  Range (min … max):    6.200 s …  6.325 s    16 runs
 
Summary
  'numactl -C 20-23 /tmp/c/0 -flavor gnu @response.txt --threads=4' ran
    1.01 ± 0.01 times faster than 'numactl -C 20-23 /tmp/c/1 -flavor gnu @response.txt --threads=4'

Note that I use mimalloc and tend to care about performance more with a better malloc (mimalloc/tcmalloc/jemalloc/snmalloc/etc) than the glibc malloc.

I think it's a bit unfortunate that the library-fication work here is making this change hard to land

I share the same concern. I know that I replied a +1 on that thread, and at that time I did not pay more attention on the performance.
Now I am probably more on the fence (more so because there are a lot of global variables which I am not sure can be cleaned up in a way not harming performance or code readability so much).
I think the original CommonLinkerContext.h change probably caused 1+% regression. Adding another 1% will be too much.
I am pretty sure llvm/Support/ThreadLocal.h will not be an acceptable solution for ELF due to the pthread_getspecific cost.
thread_local shall be fine and as mentioned previous has been used in several places in llvm and clang.

Request changes as I think we cannot land this as is.

This revision now requires changes to proceed.Apr 25 2022, 2:57 PM

In D122922#3469516, @MaskRay wrote:

I have a chrome build on Linux x86-64. Say /tmp/c/0 has a lld built from main branch. /tmp/c/1 is a lld built with this patch.

Hi @MaskRay Is this the same chromium repro package we've been using in lld-macho? if not, are you able to share with us the repro?
Thanks!

I use chrome as an ELF linker benchmark. After linking chrome, delete it, run ninja -v chrome to get the linker command line, and invoke the linker command line with -Wl,--reproduce=/tmp/chrome.tar to get a reproduce file.

/tmp/c/0: old lld
/tmp/c/1: new lld

I have run the following commands several times and notice that this version is still a bit slower (note that σ in the output is too large, but I have invoked the command many times and the new lld already appears to be a bit slower)

% hyperfine --warmup 2 --min-runs 16 "numactl -C 20-27 "{/tmp/c/0,/tmp/c/1}" -flavor gnu @response.txt --threads=8"
Benchmark 1: numactl -C 20-27 /tmp/c/0 -flavor gnu @response.txt --threads=8
  Time (mean ± σ):      6.163 s ±  0.107 s    [User: 9.653 s, System: 2.674 s]
  Range (min … max):    6.047 s …  6.395 s    16 runs
 
Benchmark 2: numactl -C 20-27 /tmp/c/1 -flavor gnu @response.txt --threads=8
  Time (mean ± σ):      6.213 s ±  0.122 s    [User: 9.619 s, System: 2.662 s]
  Range (min … max):    6.071 s …  6.474 s    16 runs
 
Summary
  'numactl -C 20-27 /tmp/c/0 -flavor gnu @response.txt --threads=8' ran
    1.01 ± 0.03 times faster than 'numactl -C 20-27 /tmp/c/1 -flavor gnu @response.txt --threads=8'

MaskRay mentioned this in D130810: [ELF] Parallelize input section initialization.Jul 29 2022, 5:18 PM

MaskRay mentioned this in rGf6bd0a8f2bc4: [ELF] Add makeThreadLocal/makeThreadLocalN and remove InputFile::localSymStorage.Aug 4 2022, 11:09 AM

Revision Contents

Path

Size

lld/

Common/

CommonLinkerContext.cpp

11 lines

Memory.cpp

48 lines

include/

lld/

Common/

CommonLinkerContext.h

25 lines

Memory.h

33 lines

Diff 420617

lld/Common/CommonLinkerContext.cpp

	Show All 28 Lines
	}			}

	CommonLinkerContext::~CommonLinkerContext() {			CommonLinkerContext::~CommonLinkerContext() {
	assert(lctx);			assert(lctx);
	// Explicitly call the destructors since we created the objects with placement			// Explicitly call the destructors since we created the objects with placement
	// new in SpecificAlloc::create().			// new in SpecificAlloc::create().
	for (auto &it : instances)			for (auto &it : instances)
	it.second->~SpecificAllocBase();			it.second->~SpecificAllocBase();

				for (PerThreadContext *context : perThreadContexts) {
				aganeaUnsubmitted Not Done Reply Inline Actions It is best if we had `llvm::sys::ScopedWriter lock(contextMutex);` here too. aganea: It is best if we had `llvm::sys::ScopedWriter lock(contextMutex);` here too.
				for (auto &instance : context->perThreadInstances)
				instance.second->~SpecificAllocBase();
				delete context;
				}
	lctx = nullptr;			lctx = nullptr;
	}			}

	CommonLinkerContext &lld::commonContext() {			CommonLinkerContext &lld::commonContext() {
	assert(lctx);			assert(lctx);
	return *lctx;			return *lctx;
	}			}

	bool lld::hasContext() { return lctx != nullptr; }			bool lld::hasContext() { return lctx != nullptr; }

	void CommonLinkerContext::destroy() {			void CommonLinkerContext::destroy() {
	if (lctx == nullptr)			if (lctx == nullptr)
	return;			return;
	delete lctx;			delete lctx;
				lctx = nullptr;
				}

				llvm::StringSaver &lld::perThreadSaver() {
				return commonContext().perThreadContext()->perThreadSaver;
	}			}

lld/Common/Memory.cpp

	//===- Memory.cpp ---------------------------------------------------------===//			//===- Memory.cpp ---------------------------------------------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "lld/Common/Memory.h"			#include <atomic>

	#include "lld/Common/CommonLinkerContext.h"			#include "lld/Common/CommonLinkerContext.h"
				#include "lld/Common/Memory.h"

	using namespace llvm;			using namespace llvm;
	using namespace lld;			using namespace lld;

				namespace {
				// TODO: may need to move these to CommonLinkerContext class when we support
				// multiple processes.
				// Value of this is the index of the thread's Context in the
				// CommonLinkerContext.
				aganeaUnsubmitted Not Done Reply Inline Actions As things stand, you could also do `LLVM_THREAD_LOCAL PerThreadContext threadContext = nullptr;` However since this is really just like a `static` it goes against the "LinkerContext" concept, and since you already added some functionality in the context, you could just as well move this in `CommonLinkerContext`. You could do something like `ThreadLocal AllocContext;` (see llvm/include/llvm/Support/ThreadLocal.h). That would allocate a dynamic TLS slot on the first use, as opposed to `thread_local` which points to a pre-allocated slot in the PE static TLS table. aganea: As things stand, you could also do `LLVM_THREAD_LOCAL PerThreadContext *threadContext = nullptr…
				oontvooAuthorUnsubmitted Done Reply Inline Actions thanks! I've moved the variable to the LinkerContext class but wasn't able to use LLVM's ThreadLocal because I needed to hold on to the variable's address (to reset its value from another thread). Would this "just work" with ThreadLocal? ie: ThreadLocal<AllocationContext> currentThreadCtxt; std::vector<ThreadLocal<AllocationContext>> allThreadContexts; // collecting all the thread-local ctxt's addresses allThreadContext.push_back(&currentThreadCtxt); // in destroy(), reset them to sentinel value (nullptr) for (auto& localCtxt : allThreadContexts) { delete localCtxt.get(); localCtxt = nullptr; } oontvoo: thanks! I've moved the variable to the LinkerContext class but wasn't able to use LLVM's…
				aganeaUnsubmitted Done Reply Inline Actions It is not possible to take the address of the TLS slot -- one can only `.get()` or `.set()` the pointer in the `ThreadLocal` object. We needed that address to reset to sentinel in the destructor. But if moving `ThreadLocal<AllocationContext>` into `CommonLinkerContext`, we don't need to reset to a sentinel value anymore, since the TLS slot in all threads will die with `CommonLinkerContext`. If the TLS slot is reused later, it'll be reset to 0 by the system. So we could just use `std::vector<AllocationContext > allThreadContexts` (plain pointer). struct CommonLinkerContext { ThreadLocal<AllocationContext> currentThreadCtxt; std::vector<AllocationContext > allThreadContexts; ... }; AllocationContext CommonLinkerContext::allocCtxt() { if (!threadContext.get()) { // Context didn't exist yet for this thread, so create a new one. auto context = new AllocationContext; threadContext.set(context); llvm::sys::ScopedWriter lock(contextMutex); allThreadContexts.push_back(context); } return threadContext.get(); } CommonLinkerContext::~CommonLinkerContext() { ... llvm::sys::ScopedWriter lock(contextMutex); for (AllocationContext context : allThreadContexts) { for (auto &instance : context->instances) instance.second->~SpecificAllocBase(); delete context; } ... } aganea:* It is not possible to take the address of the TLS slot -- one can only `.get()` or `.set()` the…
				aganeaUnsubmitted Not Done Reply Inline Actions There's still a chance that an `AllocationContext` is in use when calling `~CommonLinkerContext()`, if all threads were not `.join`ed prior. But that should probably be solved at a higher level? aganea: There's still a chance that an `AllocationContext` is in use when calling `~CommonLinkerContext…
				oontvooAuthorUnsubmitted Not Done Reply Inline Actions There's still a chance that an AllocationContext is in use when calling ~CommonLinkerContext(), if all threads were not .joined prior. But that should probably be solved at a higher level? Yes, it's already a problem now with the saver()'s and bAlloc() 's return objects being used after the context has already been destroyed. (We can rely on asan/msan for catching this, no?) oontvoo: > There's still a chance that an AllocationContext is in use when calling ~CommonLinkerContext…
				MaskRayUnsubmitted Not Done Reply Inline Actions We can use LLVM_THREAD_LOCAL instead of llvm/Support/ThreadLocal.h. llvm/Support/ThreadLocal.h uses pthread_getspecific which is slower than TLS. LLVM_THREAD_LOCAL has been used in llvm/lib/Support and clang. MaskRay: We can use LLVM_THREAD_LOCAL instead of llvm/Support/ThreadLocal.h. llvm/Support/ThreadLocal.h…
				oontvooAuthorUnsubmitted Done Reply Inline Actions yeah, I've noted that pthread_getspecific could be a bit problematic but in practice I didn't see any performance impact when running the benchmarks. The advantage of using the ThreadLocal class is that we could put the "tag" inside the context class and that's better encapsulation. WDYT? oontvoo: yeah, I've noted that pthread_getspecific could be a bit problematic but in practice I didn't…
				MaskRayUnsubmitted Not Done Reply Inline Actions Well, I observed a close to 1% regression when linking chromium :( MaskRay: Well, I observed a close to 1% regression when linking chromium :(
				// XXX: 128-bit ints are not always lock-free, so limiting to 32-bit atomic for
				// now.
				thread_local uint32_t threadTag = 0;
				std::atomic<uint32_t> threadCounts(0);

				// Returns the index for this thread's context.
				inline uint32_t getThreadIndex() {
				if (threadTag == 0) {
				threadTag = ++threadCounts;
				}
				assert(threadTag < lld::MAX_THREADS);
				return threadTag;
				}

				} // namespace

	SpecificAllocBase *			SpecificAllocBase *
	lld::SpecificAllocBase::getOrCreate(void *tag, size_t size, size_t align,			lld::SpecificAllocBase::getOrCreate(void *tag, size_t size, size_t align,
	SpecificAllocBase (&creator)(void )) {			SpecificAllocBase (&creator)(void )) {
	auto &instances = context().instances;			auto &instances = context().instances;
	auto &instance = instances[tag];			auto &instance = instances[tag];
	if (instance == nullptr) {			if (instance == nullptr) {
	void *storage = context().bAlloc.Allocate(size, align);			void *storage = context().bAlloc.Allocate(size, align);
	instance = creator(storage);			instance = creator(storage);
	}			}
	return instance;			return instance;
	}			}

				PerThreadContext *CommonLinkerContext::perThreadContext() {
				uint32_t threadIdx = getThreadIndex();
				int3Unsubmitted Not Done Reply Inline Actions I was hoping we could avoid mutexes altogether. I was thinking of something like std::atomic<size_t> contextCount; std::array<PerThreadContext , MAX_THREADS> perThreadContexts; if (threadTag == 0) { threadTag = contextCount++; perThreadContexts[threadTag] = new PerThreadContext; } int3:* I was hoping we could avoid mutexes altogether. I was thinking of something like ``` std…
				oontvooAuthorUnsubmitted Not Done Reply Inline Actions What is the value of MAX_THREADS? Doesn't that mean "max number of threads that can run concurrently" and not "max number of threads ever created during the application runtime"? As new threads are spawned up, the arrays can expand and that'd be a potential race condition, yes? oontvoo: What is the value of MAX_THREADS? Doesn't that mean "max number of threads that can run…
				int3Unsubmitted Not Done Reply Inline Actions No, I meant "max number of threads ever created" :) Since we use threadpools, we shouldn't be creating that many threads over the application lifetime anyway. But you got me thinking if there's a nicer way to implement this that doesn't involve the user having to calculate that number up front. We could have a thread-local pointer to the per-thread context, so that checking if the context has been creating is a simple null check (as opposed to the current hashmap lookup.) We only take a lock the first time a thread executes and needs to create a new PerThreadContext, in order to safely add it to a global vector. To make it 100% lock-free, we could implement our own append-only dynamically expanding series of arrays, much like how the BumpPtrAllocator uses its Slabs. But that's probably overkill :) int3: No, I meant "max number of threads ever created" :) Since we use threadpools, we shouldn't be…
				oontvooAuthorUnsubmitted Done Reply Inline Actions ok, gotcha! benchmarking showed no difference between the previous and the last approach (https://reviews.llvm.org/D123075) Which one do you prefer? I guess I liked this patch (updated diff) a bit better oontvoo: ok, gotcha! benchmarking showed no difference between the previous and the last approach (https…
				int3Unsubmitted Not Done Reply Inline Actions constexpr uint32_t MAX_THREADS = std::numeric_limits<int32>::max() - 1; doesn't that mean our std::array is now taking up like 32 MB 🤔 I was thinking of using a much lower number... but IMO the vector solution I suggested above is cleaner. we would need a lock when pushing onto the vector, but we'll only need to do it once per thread, and we wouldn't need to pre-allocate a whole bunch of memory benchmarking showed no difference between the previous and the last approach good to know! do we have any regression vs the non-thread-safe version? int3: > constexpr uint32_t MAX_THREADS = std::numeric_limits<int32>::max() - 1; doesn't that mean…
				oontvooAuthorUnsubmitted Done Reply Inline Actions but IMO the vector solution I suggested above is cleaner. we would need a lock when pushing onto the vector, but we'll only need to do it once per thread, and we wouldn't need to pre-allocate a > whole bunch of memory fair enough - updated the diff to that approach. do we have any regression vs the non-thread-safe version? also no difference (testing by unconditionally define THREAD_SAFE_MEMORY to 1. ) Question: Do we want to just enable it now? (and users who don't want it can turn it off or call the old functions (now renamed to have "unsafe" suffix)? oontvoo: > but IMO the vector solution I suggested above is cleaner. we would need a lock when pushing…
				int3Unsubmitted Not Done Reply Inline Actions also no difference (testing by unconditionally define THREAD_SAFE_MEMORY to 1. ) nice! Do we want to just enable it now? (and users who don't want it can turn it off or call the old functions (now renamed to have "unsafe" suffix)? sgtm. I would actually argue that we don't even need to keep the "unsafe" versions around. Folks can add it back if they ever have a use case for it (I can't imagine one atm.) But @MaskRay might want to chime in. Also, can we add someone who works on the COFF backend of LLD as a reviewer? int3: > also no difference (testing by unconditionally define THREAD_SAFE_MEMORY to 1. ) nice! > Do…
				oontvooAuthorUnsubmitted Done Reply Inline Actions Do you know who should be added for COFF? (haven't followed closely .. maybe @rnk ?) oontvoo: Do you know who should be added for COFF? (haven't followed closely .. maybe @rnk ?)
				smeenaiUnsubmitted Not Done Reply Inline Actions @aganea and @mstorsjo are also good contacts for the COFF side. smeenai: @aganea and @mstorsjo are also good contacts for the COFF side.
				PerThreadContext *perThreadContext = perThreadContexts[threadIdx];
				// Context didn't exist yet, so create a new one for this thread.
				if (!perThreadContext)
				perThreadContexts[threadIdx] = perThreadContext = new PerThreadContext;
				tschuettUnsubmitted Not Done Reply Inline Actions Would an assert on `perThreadContexts.size() < FOO` be helpful? tschuett: Would an assert on `perThreadContexts.size() < FOO` be helpful?
				oontvooAuthorUnsubmitted Done Reply Inline Actions the `.size()` means the number of threads ever created by this process. (note: it is NOT the number of threads that run concurrently). As such, I'm not sure we care what the limit is. (other than that if it runs out of memory it'll crash, but then we'd get a stacktrace and that's useful enough) Can you clarify why we should cap this? oontvoo: the `.size()` means the number of threads ever created by this process. (note: it is NOT…

				return perThreadContext;
				}

				SpecificAllocBase *lld::SpecificAllocBase::getOrCreatePerThread(
				void *tag, size_t size, size_t align,
				SpecificAllocBase (&creator)(void )) {
				PerThreadContext *threadContext = context().perThreadContext();
				auto &instances = threadContext->perThreadInstances;
				auto &instance = instances[tag];
				if (instance == nullptr) {
				void *storage = threadContext->bAlloc.Allocate(size, align);
				instance = creator(storage);
				}
				return instance;
				}

lld/include/lld/Common/CommonLinkerContext.h

	Show All 15 Lines
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLD_COMMON_COMMONLINKINGCONTEXT_H			#ifndef LLD_COMMON_COMMONLINKINGCONTEXT_H
	#define LLD_COMMON_COMMONLINKINGCONTEXT_H			#define LLD_COMMON_COMMONLINKINGCONTEXT_H

	#include "lld/Common/ErrorHandler.h"			#include "lld/Common/ErrorHandler.h"
	#include "lld/Common/Memory.h"			#include "lld/Common/Memory.h"
	#include "llvm/Support/StringSaver.h"			#include "llvm/Support/StringSaver.h"
				MaskRayUnsubmitted Done Reply Inline Actions Delete unordered_map is very inefficient. MaskRay: Delete unordered_map is very inefficient.
				oontvooAuthorUnsubmitted Done Reply Inline Actions (removed - it was never used...) oontvoo: (removed - it was never used...)
				#include <array>
				#include <limits>
				int3Unsubmitted Done Reply Inline Actions can rm int3: can rm

	namespace llvm {			namespace llvm {
	class raw_ostream;			class raw_ostream;
	} // namespace llvm			} // namespace llvm

	namespace lld {			namespace lld {
	struct SpecificAllocBase;			struct SpecificAllocBase;

				// Max number of threads ever created in a process.
				constexpr uint32_t MAX_THREADS = std::numeric_limits<uint32_t>::max() - 1;
				int3Unsubmitted Done Reply Inline Actions can rm int3: can rm

				aganeaUnsubmitted Done Reply Inline Actions I would rather name it considering what the struct does, not how it's used? `AllocationContext` maybe? aganea: I would rather name it considering what the struct does, not how it's used? `AllocationContext`…
				// TODO Maybe just use the CommonLinkerContext class - but I dont want
				// unnecessary nesting of the mutexes
				struct PerThreadContext {
				llvm::BumpPtrAllocator bAlloc;
				llvm::DenseMap<void , SpecificAllocBase > perThreadInstances;
				llvm::StringSaver perThreadSaver{bAlloc};
				int3Unsubmitted Done Reply Inline Actions nit: IMO it's not necessary to have `perThread` in these names, since they're already part of a class called `PerThreadContext` :) int3: nit: IMO it's not necessary to have `perThread` in these names, since they're already part of a…
				};

	class CommonLinkerContext {			class CommonLinkerContext {
	public:			public:
	CommonLinkerContext();			CommonLinkerContext();
	virtual ~CommonLinkerContext();			virtual ~CommonLinkerContext();

	static void destroy();			static void destroy();

	llvm::BumpPtrAllocator bAlloc;			llvm::BumpPtrAllocator bAlloc;
				aganeaUnsubmitted Not Done Reply Inline Actions Why not use your new struct here, even when `LLD_THREAD_SAFE_MEMORY == 0`? aganea: Why not use your new struct here, even when `LLD_THREAD_SAFE_MEMORY == 0`?
	llvm::StringSaver saver{bAlloc};			llvm::StringSaver saver{bAlloc};
	llvm::DenseMap<void , SpecificAllocBase > instances;			llvm::DenseMap<void , SpecificAllocBase > instances;

				aganeaUnsubmitted Not Done Reply Inline Actions Change to `globalContext`, as opposed to `perThreadContext`? aganea: Change to `globalContext`, as opposed to `perThreadContext`?
	ErrorHandler e;			ErrorHandler e;

				// Per thread allocs and savers.
				PerThreadContext *perThreadContext();
				// [0] is unused to make thread's index checking simpler, because 0 is
				// sentinel for the index.
				std::array<PerThreadContext *, MAX_THREADS> perThreadContexts;
	};			};

	// Retrieve the global state. Currently only one state can exist per process,			// Retrieve the global state. Currently only one state can exist per process,
	// but in the future we plan on supporting an arbitrary number of LLD instances			// but in the future we plan on supporting an arbitrary number of LLD instances
	// in a single process.			// in a single process.
				// TODO: How would this affect the TLS variable's addresses?
				// (hopefully not much)
				int3Unsubmitted Done Reply Inline Actions not sure what this comment is about. what's the concern here? int3: not sure what this comment is about. what's the concern here?
				oontvooAuthorUnsubmitted Done Reply Inline Actions that was from the previous impl (with the mapping). removed now oontvoo: that was from the previous impl (with the mapping). removed now
	CommonLinkerContext &commonContext();			CommonLinkerContext &commonContext();

	template <typename T = CommonLinkerContext> T &context() {			template <typename T = CommonLinkerContext> T &context() {
	return static_cast<T &>(commonContext());			return static_cast<T &>(commonContext());
	}			}

	bool hasContext();			bool hasContext();

				llvm::StringSaver &perThreadSaver();
				llvm::BumpPtrAllocator &perThreadbAlloc();

	inline llvm::StringSaver &saver() { return context().saver; }			inline llvm::StringSaver &saver() { return context().saver; }
	inline llvm::BumpPtrAllocator &bAlloc() { return context().bAlloc; }			inline llvm::BumpPtrAllocator &bAlloc() { return context().bAlloc; }
	} // namespace lld			} // namespace lld

	#endif			#endif
				aganeaUnsubmitted Done Reply Inline Actions `LLD_THREAD_SAFE_MEMORY` perhaps? Is this meant to be configured through cmake? aganea: `LLD_THREAD_SAFE_MEMORY` perhaps? Is this meant to be configured through cmake?
				aganeaUnsubmitted Not Done Reply Inline Actions I agree that those "unsafe" functions might not be needed. The same applies to the two "perThread" functions above. Can we just go through `saver()` and `bAlloc()`? Do we really want application code to explicitly choose either per-thread allocation pool or the global allocation pool? aganea:* I agree that those "unsafe" functions might not be needed. The same applies to the two…
				oontvooAuthorUnsubmitted Done Reply Inline Actions @MaskRay : any concern with removing this option? IIRC, you would like this option for performance reason. oontvoo: @MaskRay : any concern with removing this option? IIRC, you would like this option for…
				int3Unsubmitted Done Reply Inline Actions let's remove it for now, folks can always add it back if they want to use it let's drop the "perThread" prefixes too int3: let's remove it for now, folks can always add it back if they want to use it let's drop the…

lld/include/lld/Common/Memory.h

	Show All 16 Lines
	// Most objects are allocated using the arena allocators defined by this file.			// Most objects are allocated using the arena allocators defined by this file.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLD_COMMON_MEMORY_H			#ifndef LLD_COMMON_MEMORY_H
	#define LLD_COMMON_MEMORY_H			#define LLD_COMMON_MEMORY_H

	#include "llvm/Support/Allocator.h"			#include "llvm/Support/Allocator.h"
				#include "llvm/Support/RWMutex.h"

	namespace lld {			namespace lld {

	// A base class only used by the CommonLinkerContext to keep track of the			// A base class only used by the CommonLinkerContext to keep track of the
	// SpecificAlloc<> instances.			// SpecificAlloc<> instances.
	struct SpecificAllocBase {			struct SpecificAllocBase {
	virtual ~SpecificAllocBase() = default;			virtual ~SpecificAllocBase() = default;
	static SpecificAllocBase getOrCreate(void tag, size_t size, size_t align,			static SpecificAllocBase getOrCreate(void tag, size_t size, size_t align,
	SpecificAllocBase (&creator)(void ));			SpecificAllocBase (&creator)(void ));

				static SpecificAllocBase *
				getOrCreatePerThread(void *tag, size_t size, size_t align,
				SpecificAllocBase (&creator)(void ));
	};			};

	// An arena of specific types T, created on-demand.			// An arena of specific types T, created on-demand.
	template <class T> struct SpecificAlloc : public SpecificAllocBase {			template <class T> struct SpecificAlloc : public SpecificAllocBase {
	static SpecificAllocBase create(void storage) {			static SpecificAllocBase create(void storage) {
	return new (storage) SpecificAlloc<T>();			return new (storage) SpecificAlloc<T>();
	}			}
	llvm::SpecificBumpPtrAllocator<T> alloc;			llvm::SpecificBumpPtrAllocator<T> alloc;
	Show All 9 Lines
	template <typename T>			template <typename T>
	inline llvm::SpecificBumpPtrAllocator<T> &getSpecificAllocSingleton() {			inline llvm::SpecificBumpPtrAllocator<T> &getSpecificAllocSingleton() {
	SpecificAllocBase *instance = SpecificAllocBase::getOrCreate(			SpecificAllocBase *instance = SpecificAllocBase::getOrCreate(
	&SpecificAlloc<T>::tag, sizeof(SpecificAlloc<T>),			&SpecificAlloc<T>::tag, sizeof(SpecificAlloc<T>),
	alignof(SpecificAlloc<T>), SpecificAlloc<T>::create);			alignof(SpecificAlloc<T>), SpecificAlloc<T>::create);
	return ((SpecificAlloc<T> *)instance)->alloc;			return ((SpecificAlloc<T> *)instance)->alloc;
	}			}

				template <typename T>
				inline llvm::SpecificBumpPtrAllocator<T> &getSpecificAllocSingletonPerThread() {
				SpecificAllocBase *instance = SpecificAllocBase::getOrCreatePerThread(
				&SpecificAlloc<T>::tag, sizeof(SpecificAlloc<T>),
				alignof(SpecificAlloc<T>), SpecificAlloc<T>::create);
				return ((SpecificAlloc<T> *)instance)->alloc;
				}
	// Creates new instances of T off a (almost) contiguous arena/object pool. The			// Creates new instances of T off a (almost) contiguous arena/object pool. The
	// instances are destroyed whenever lldMain() goes out of scope.			// instances are destroyed whenever lldMain() goes out of scope.
				template <typename T, typename... U> T *makeUnsafe(U &&...args) {
				return new (getSpecificAllocSingleton<T>().Allocate())
				aganeaUnsubmitted Not Done Reply Inline Actions Is there a need for "unsafe" or "perthread"? aganea:* Is there a need for "unsafe" or "perthread*"?
				oontvooAuthorUnsubmitted Done Reply Inline Actions I didn't want to change all the ports to the thread-safe version because I'm not familiar with all of them. (and was only able to do benchmarking for the macho port). If you all think it's "safe" to do this, then yeah, it'd simplify this patch a bit! :) oontvoo: I didn't want to change all the ports to the thread-safe version because I'm not familiar with…
				T(std::forward<U>(args)...);
				}

				// Creates new instances of T off a (almost) contiguous arena/object pool
				// specific for this thread. The instances are destroyed whenever lldMain() goes
				// out of scope.
				template <typename T, typename... U> T *makePerThread(U &&...args) {
				return new (getSpecificAllocSingletonPerThread<T>().Allocate())
				T(std::forward<U>(args)...);
				}

	template <typename T, typename... U> T *make(U &&... args) {			template <typename T, typename... U> T *make(U &&...args) {
				#ifndef THREAD_SAFE_MEMORY
	return new (getSpecificAllocSingleton<T>().Allocate())			return new (getSpecificAllocSingleton<T>().Allocate())
	T(std::forward<U>(args)...);			T(std::forward<U>(args)...);
				#else
				return new (getSpecificAllocSingletonPerThread<T>().Allocate())
				T(std::forward<U>(args)...);
				#endif
	}			}

	} // namespace lld			} // namespace lld

	#endif			#endif
				MaskRayUnsubmitted Done Reply Inline Actions PerThread may be better than ThreadSafe. MaskRay: PerThread may be better than ThreadSafe.

This is an archive of the discontinued LLVM Phabricator instance.

[lld][common][lld-macho][lld-coff] Support per-thread allocators and StringSaversNeeds RevisionPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 420617

lld/Common/CommonLinkerContext.cpp

lld/Common/Memory.cpp

lld/include/lld/Common/CommonLinkerContext.h

lld/include/lld/Common/Memory.h

[lld][common][lld-macho][lld-coff] Support per-thread allocators and StringSavers
Needs RevisionPublic