This is an archive of the discontinued LLVM Phabricator instance.

[lld][common][lld-macho][lld-coff] Support per-thread allocators and StringSavers
Needs RevisionPublic

Authored by oontvoo on Apr 1 2022, 10:52 AM.

Details

Reviewers
int3
MaskRay
mstorsjo
rnk
aganea
jyknight
Group Reviewers
Restricted Project
Summary

Details:

Provide thread-safe StringSaver and bAlloc by assigning each thread its own StringSaver and allocator.
Usage: code that might be run in different threads should call the new perThreadSaver() or makePerThread()
instead of the current util (saver() and make())

User can enable this unconditionally at compile time by setting -DTHREAD_SAFE_MEMORY

Some considerations:

  • This might have some perf/memroy impact.
  • TLS support might not be available on all archs. (Darwin?)

Benchmarks done on linking envoy and chromium for darwins showed no perf different when using the new allocators

Diff Detail

Event Timeline

oontvoo created this revision.Apr 1 2022, 10:52 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 1 2022, 10:52 AM
Herald added a subscriber: StephenFan. · View Herald Transcript
oontvoo requested review of this revision.Apr 1 2022, 10:52 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 1 2022, 10:52 AM
int3 added inline comments.Apr 1 2022, 3:32 PM
lld/Common/Memory.cpp
41

I was hoping we could avoid mutexes altogether. I was thinking of something like

std::atomic<size_t> contextCount;
std::array<PerThreadContext *, MAX_THREADS> perThreadContexts;

if (threadTag == 0) {
  threadTag = contextCount++;
  perThreadContexts[threadTag] = new PerThreadContext;
}
MaskRay added inline comments.Apr 1 2022, 9:51 PM
lld/include/lld/Common/CommonLinkerContext.h
24

Delete

unordered_map is very inefficient.

lld/include/lld/Common/Memory.h
82

PerThread may be better than ThreadSafe.

oontvoo updated this revision to Diff 420179.Apr 4 2022, 7:28 AM
oontvoo marked 2 inline comments as done.

remoevd unused includes and rename make() func

oontvoo added inline comments.Apr 4 2022, 10:29 AM
lld/Common/Memory.cpp
41

What is the value of MAX_THREADS? Doesn't that mean "max number of threads that can run concurrently" and not "max number of threads ever created during the application runtime"?

As new threads are spawned up, the arrays can expand and that'd be a potential race condition, yes?

lld/include/lld/Common/CommonLinkerContext.h
24

(removed - it was never used...)

int3 added inline comments.Apr 4 2022, 11:13 AM
lld/Common/Memory.cpp
41

No, I meant "max number of threads ever created" :)

Since we use threadpools, we shouldn't be creating that many threads over the application lifetime anyway.

But you got me thinking if there's a nicer way to implement this that doesn't involve the user having to calculate that number up front. We could have a thread-local pointer to the per-thread context, so that checking if the context has been creating is a simple null check (as opposed to the current hashmap lookup.) We only take a lock the first time a thread executes and needs to create a new PerThreadContext, in order to safely add it to a global vector.

To make it 100% lock-free, we could implement our own append-only dynamically expanding series of arrays, much like how the BumpPtrAllocator uses its Slabs. But that's probably overkill :)

oontvoo added inline comments.Apr 4 2022, 1:40 PM
lld/Common/Memory.cpp
41

ok, gotcha! benchmarking showed no difference between the previous and the last approach (https://reviews.llvm.org/D123075)

Which one do you prefer?
I guess I liked this patch (updated diff) a bit better

oontvoo updated this revision to Diff 420617.Apr 5 2022, 1:21 PM

use thread_local, which is more portable than __thread

oontvoo edited the summary of this revision. (Show Details)Apr 5 2022, 1:22 PM
oontvoo updated this revision to Diff 420881.Apr 6 2022, 8:25 AM

fixed windows failures

oontvoo edited the summary of this revision. (Show Details)Apr 6 2022, 8:26 AM
int3 added inline comments.Apr 6 2022, 12:24 PM
lld/Common/Memory.cpp
41

constexpr uint32_t MAX_THREADS = std::numeric_limits<int32>::max() - 1;

doesn't that mean our std::array is now taking up like 32 MB 🤔 I was thinking of using a much lower number...

but IMO the vector solution I suggested above is cleaner. we would need a lock when pushing onto the vector, but we'll only need to do it once per thread, and we wouldn't need to pre-allocate a whole bunch of memory

benchmarking showed no difference between the previous and the last approach

good to know! do we have any regression vs the non-thread-safe version?

You cannot query the maximum parallelism ondemand and then create the array?

int3 added a comment.Apr 6 2022, 12:32 PM

You cannot query the maximum parallelism ondemand and then create the array?

It's a little fragile -- things would break if someone created more than one ThreadPool instance & gave each instance the max hardware parallelism

You cannot query the maximum parallelism ondemand and then create the array?

It's a little fragile -- things would break if someone created more than one ThreadPool instance & gave each instance the max hardware parallelism

But then MAX_THREADS has the same problem? For two thread pools I would need 2 * MAX_THREADS ?

int3 added a comment.Apr 6 2022, 12:36 PM

right that's why I'm suggesting the use of a vector

oontvoo added inline comments.Apr 6 2022, 12:37 PM
lld/Common/Memory.cpp
41

but IMO the vector solution I suggested above is cleaner. we would need a lock when pushing onto the vector, but we'll only need to do it once per thread, and we wouldn't need to pre-allocate a > whole bunch of memory

fair enough - updated the diff to that approach.

do we have any regression vs the non-thread-safe version?

also no difference (testing by unconditionally define THREAD_SAFE_MEMORY to 1. )

Question: Do we want to just enable it now? (and users who don't want it can turn it off or call the old functions (now renamed to have "unsafe" suffix)?

oontvoo updated this revision to Diff 420975.Apr 6 2022, 12:37 PM

updated diff

oontvoo updated this revision to Diff 420978.Apr 6 2022, 12:43 PM

updated diff

int3 added inline comments.Apr 6 2022, 12:52 PM
lld/Common/Memory.cpp
41

also no difference (testing by unconditionally define THREAD_SAFE_MEMORY to 1. )

nice!

Do we want to just enable it now? (and users who don't want it can turn it off or call the old functions (now renamed to have "unsafe" suffix)?

sgtm. I would actually argue that we don't even need to keep the "unsafe" versions around. Folks can add it back if they ever have a use case for it (I can't imagine one atm.) But @MaskRay might want to chime in. Also, can we add someone who works on the COFF backend of LLD as a reviewer?

lld/include/lld/Common/CommonLinkerContext.h
26–27

can rm

34–35

can rm

41–42

nit: IMO it's not necessary to have perThread in these names, since they're already part of a class called PerThreadContext :)

68–69

not sure what this comment is about. what's the concern here?

oontvoo updated this revision to Diff 420991.Apr 6 2022, 1:11 PM
oontvoo marked 4 inline comments as done.

addressed review comments

lld/include/lld/Common/CommonLinkerContext.h
68–69

that was from the previous impl (with the mapping). removed now

oontvoo retitled this revision from [lld][common][lld-macho] Support per-thread allocators and StringSavers to [lld][common][lld-macho][lld-coff] Support per-thread allocators and StringSavers.Apr 6 2022, 1:12 PM
oontvoo added a reviewer: Restricted Project.
oontvoo added a subscriber: rnk.
oontvoo added inline comments.
lld/Common/Memory.cpp
41

Do you know who should be added for COFF? (haven't followed closely .. maybe @rnk ?)

tschuett added inline comments.Apr 6 2022, 1:15 PM
lld/Common/Memory.cpp
45

Would an assert on perThreadContexts.size() < FOO be helpful?

oontvoo added inline comments.Apr 6 2022, 1:20 PM
lld/Common/Memory.cpp
45

the .size() means the number of threads ever created by this process. (note: it is NOT the number of threads that run concurrently). As such, I'm not sure we care what the limit is. (other than that if it runs out of memory it'll crash, but then we'd get a stacktrace and that's useful enough)
Can you clarify why we should cap this?

My bad. I though of a last resort before the system crashes.

smeenai added inline comments.
lld/Common/Memory.cpp
41

@aganea and @mstorsjo are also good contacts for the COFF side.

oontvoo edited the summary of this revision. (Show Details)Apr 6 2022, 1:31 PM
oontvoo added reviewers: mstorsjo, rnk, aganea.
aganea added a subscriber: lattner.Apr 7 2022, 6:48 AM

Hello! Thanks for adding me :-) Interesting challenge!
It's a bit sad that we have to do all this high-level/application-level memory management. Most modern allocators already support (lock-free) per-thread memory pools out-of-the-box, along with migration between threads and to the global pool/arena. This patch seems to be needed solely because we use BumpPtrAllocator/SpecificBumpPtrAllocator. I wonder how things would perform with just using malloc() instead + a modern underlying allocator (rpmalloc, mimalloc, ...). Memory locality brought by the bumpalloc is important, but it'd be interesting to compare benchmarks. FWIW there were discussions with @lattner at some point about integrating rpmalloc into the LLVM codebase, but I never got to post a RFC.

lld/Common/Memory.cpp
19

As things stand, you could also do LLVM_THREAD_LOCAL PerThreadContext *threadContext = nullptr;
However since this is really just like a static it goes against the "*LinkerContext" concept, and since you already added some functionality in the context, you could just as well move this in CommonLinkerContext. You could do something like ThreadLocal AllocContext; (see llvm/include/llvm/Support/ThreadLocal.h). That would allocate a dynamic TLS slot on the first use, as opposed to thread_local which points to a pre-allocated slot in the PE static TLS table.

lld/include/lld/Common/CommonLinkerContext.h
36

I would rather name it considering what the struct does, not how it's used? AllocationContext maybe?

48–49

Why not use your new struct here, even when LLD_THREAD_SAFE_MEMORY == 0?

83

I agree that those "unsafe" functions might not be needed. The same applies to the two "perThread*" functions above. Can we just go through saver() and bAlloc()? Do we really want application code to explicitly choose either per-thread allocation pool or the global allocation pool?

96

LLD_THREAD_SAFE_MEMORY perhaps? Is this meant to be configured through cmake?

lld/include/lld/Common/Memory.h
74

Is there a need for "unsafe" or "perthread*"?

Hello! Thanks for adding me :-) Interesting challenge!
It's a bit sad that we have to do all this high-level/application-level memory management. Most modern allocators already support (lock-free) per-thread memory pools out-of-the-box, along with migration between threads and to the global pool/arena. This patch seems to be needed solely because we use BumpPtrAllocator/SpecificBumpPtrAllocator. I wonder how things would perform with just using malloc() instead + a modern underlying allocator (rpmalloc, mimalloc, ...). Memory locality brought by the bumpalloc is important, but it'd be interesting to compare benchmarks. FWIW there were discussions with @lattner at some point about integrating rpmalloc into the LLVM codebase, but I never got to post a RFC.

We did experimented with using a modern/threadsafe allocator here, specifically tcmalloc internally at google and found that it was pretty good (ie., no slower than the bump-ptr alloc). But conceptually the bump-alloc could be faster in simple cases so we thought it wouldn't hurt to use both when possible. (ie., have a thread safe bump allocator and a fast system malloc).

oontvoo added inline comments.Apr 7 2022, 8:04 AM
lld/Common/Memory.cpp
19

thanks! I've moved the variable to the LinkerContext class but wasn't able to use LLVM's ThreadLocal because I needed to hold on to the variable's address (to reset its value from another thread).

Would this "just work" with ThreadLocal?

ie:

ThreadLocal<AllocationContext> currentThreadCtxt;


std::vector<ThreadLocal<AllocationContext>> allThreadContexts;

// collecting all the thread-local ctxt's addresses
allThreadContext.push_back(&currentThreadCtxt);

// in destroy(), reset them to sentinel value (nullptr)
for (auto& localCtxt : allThreadContexts) {
  delete localCtxt.get();
  localCtxt = nullptr;
}
lld/include/lld/Common/Memory.h
74

I didn't want to change all the ports to the thread-safe version because I'm not familiar with all of them. (and was only able to do benchmarking for the macho port).
If you all think it's "safe" to do this, then yeah, it'd simplify this patch a bit! :)

aganea added inline comments.Apr 7 2022, 10:03 AM
lld/Common/CommonLinkerContext.cpp
54

It is best if we had llvm::sys::ScopedWriter lock(contextMutex); here too.

lld/Common/Memory.cpp
19

It is not possible to take the address of the TLS slot -- one can only .get() or .set() the pointer in the ThreadLocal object. We needed that address to reset to sentinel in the destructor. But if moving ThreadLocal<AllocationContext> into CommonLinkerContext, we don't need to reset to a sentinel value anymore, since the TLS slot in all threads will die with CommonLinkerContext. If the TLS slot is reused later, it'll be reset to 0 by the system. So we could just use std::vector<AllocationContext *> allThreadContexts (plain pointer).

struct CommonLinkerContext {
  ThreadLocal<AllocationContext> currentThreadCtxt;
  std::vector<AllocationContext *> allThreadContexts;
  ...
};

AllocationContext *CommonLinkerContext::allocCtxt() {
  if (!threadContext.get()) {
    // Context didn't exist yet for this thread, so create a new one.
    auto *context = new AllocationContext;
    threadContext.set(context);

    llvm::sys::ScopedWriter lock(contextMutex);
    allThreadContexts.push_back(context);
  }
  return threadContext.get();
}

CommonLinkerContext::~CommonLinkerContext() {
  ...
  llvm::sys::ScopedWriter lock(contextMutex);
  for (AllocationContext *context : allThreadContexts) {
    for (auto &instance : context->instances)
      instance.second->~SpecificAllocBase();
    delete context;
  }
  ...
}
aganea added inline comments.Apr 7 2022, 10:12 AM
lld/Common/Memory.cpp
19

There's still a chance that an AllocationContext is in use when calling ~CommonLinkerContext(), if all threads were not .joined prior. But that should probably be solved at a higher level?

int3 added a comment.Apr 7 2022, 10:55 AM

We did experimented with using a modern/threadsafe allocator here, specifically tcmalloc internally at google and found that it was pretty good (ie., no slower than the bump-ptr alloc). But conceptually the bump-alloc could be faster in simple cases so we thought it wouldn't hurt to use both when possible. (ie., have a thread safe bump allocator and a fast system malloc).

IIRC you found that tcmalloc-for-everything was at parity with bump ptr alloc + system allocator, but bump ptr allocator + tcmalloc together was still faster than tcmalloc-for-everything, right?

oontvoo updated this revision to Diff 421308.Apr 7 2022, 12:09 PM
oontvoo marked 3 inline comments as done.

Addressed review comments:

  • Use llvm::sys::ThreadLocal instead of native thread_local (potential concern: the pthread_*() functions it uses might be a bit slower?)
  • Rename macro to LLD_THREAD_SAFE_MEMORY
oontvoo added inline comments.Apr 7 2022, 1:02 PM
lld/Common/Memory.cpp
19

There's still a chance that an AllocationContext is in use when calling ~CommonLinkerContext(), if all threads were not .joined prior. But that should probably be solved at a higher level?

Yes, it's already a problem now with the saver()'s and bAlloc() 's return objects being used after the context has already been destroyed.
(We can rely on asan/msan for catching this, no?)

lld/include/lld/Common/CommonLinkerContext.h
83

@MaskRay : any concern with removing this option? IIRC, you would like this option for performance reason.

lattner removed a subscriber: lattner.Apr 7 2022, 1:44 PM

We did experimented with using a modern/threadsafe allocator here, specifically tcmalloc internally at google and found that it was pretty good (ie., no slower than the bump-ptr alloc). But conceptually the bump-alloc could be faster in simple cases so we thought it wouldn't hurt to use both when possible. (ie., have a thread safe bump allocator and a fast system malloc).

IIRC you found that tcmalloc-for-everything was at parity with bump ptr alloc + system allocator, but bump ptr allocator + tcmalloc together was still faster than tcmalloc-for-everything, right?

Right - that's correct. :) Thanks for clarifying that. (except with the major caveat that tcmalloc doesn't officially support darwins)

The goal if this patch, though, isn't necessarily to speed up the allocators - it's simply to make it thread safe and not slower.

int3 added a comment.Apr 11 2022, 12:24 PM

Looks like the other comments have been addressed. I will stamp it at EOD today if no one objects

lld/include/lld/Common/CommonLinkerContext.h
83

let's remove it for now, folks can always add it back if they want to use it

let's drop the "perThread" prefixes too

oontvoo updated this revision to Diff 422020.Apr 11 2022, 1:13 PM
oontvoo marked an inline comment as done.

removed the unneeded helpers and define the cmake option for LLD_THREAD_SAFE_MEMORY

int3 accepted this revision.Apr 11 2022, 5:54 PM

lgtm. Let's update the commit message, in particular

Usage: code that might be run in different threads should call the new perThreadSaver() or makePerThread()

instead of the current util (saver() and make())

also

TLS support might not be available on all archs. (Darwin?)

Is this still an ongoing concern? (Why do you think TLS support might not be available on Darwin?)

lld/CMakeLists.txt
228

I'm generally not a fan of adding compile-time options as I think it increases the number of potentially poorly-tested code paths. Maybe we could just use LLD_ENABLE_THREADS to gate the code currently protected by LLVM_THREAD_SAFE_MEMORY.

But I guess having things behind a compile-time flag for now will make this diff easier to land. It's conceivable that some buildbots may be unhappy with this depending on e.g. the target-specific support for TLS. Can we follow up with a diff that makes the thread-safe behavior the default? If that passes all the buildbots, I think we should make it the default behavior.

This revision is now accepted and ready to land.Apr 11 2022, 5:54 PM
MaskRay added inline comments.Apr 11 2022, 8:30 PM
lld/Common/Memory.cpp
19

We can use LLVM_THREAD_LOCAL instead of llvm/Support/ThreadLocal.h. llvm/Support/ThreadLocal.h uses pthread_getspecific which is slower than TLS.

LLVM_THREAD_LOCAL has been used in llvm/lib/Support and clang.

MaskRay added a comment.EditedApr 11 2022, 9:04 PM

In the LLD_THREAD_SAFE_MEMORY code path, make<Foo>(...) now has significantly larger overhead due to the pthread_getspecific call (via llvm/Support/ThreadLocal.h).
A large number of make<...> instantiations in lld do not benefit from per-thread allocation.
So my thought (I mentioned this somewhere) is that we introduce a new make utility, let the few that actually benefit use it, instead of hijacking all make<...> to be pure per-thread.

If we trace the logic in a debugger, there is a surge of abstraction costs. A make call needs:

  • call CommonLinkerContext.cpp lld::commonContext() to get the global variable lctx (there is a TODO that it may be thread_local)
  • in CommonLinkerContext::perThreadAllocContext(), call pthread_getspecific to get the thread-specific data key
  • in lld::SpecificAllocBase::getOrCreatePerThread, threadContext->instances[tag] retrieves the instance from a DenseMap

I'd hope we step back and think whether the overhead can be reduced.

MaskRay added a subscriber: lattner.EditedApr 11 2022, 9:19 PM

Hello! Thanks for adding me :-) Interesting challenge!

It's a bit sad that we have to do all this high-level/application-level memory management. Most modern allocators already support (lock-free) per-thread memory pools out-of-the-box, along with migration between threads and to the global pool/arena. This patch seems to be needed solely because we use BumpPtrAllocator/SpecificBumpPtrAllocator. I wonder how things would perform with just using malloc() instead + a modern underlying allocator (rpmalloc, mimalloc, ...). Memory locality brought by the bumpalloc is important, but it'd be interesting to compare benchmarks. FWIW there were discussions with @lattner at some point about integrating rpmalloc into the LLVM codebase, but I never got to post a RFC.

I think the most important reason that we need llvm::SpecificBumpPtrAllocator<T> is for its destructors.
In lld, we allocate many objects with non-trivial destructors. make<T>(...) allows us to be lazy and not think of the destructor.
I have actually replaced many make<T>(...) singleton usage in lld/ELF with std::make_unique<T>(...) to save code size.

The second benefit is application-level memory management is more efficient when prudently used.
Whatever lock-free scheme is used, the system malloc will have higher overhead than a use case where objects are rarely used (bump allocator).
Very few classes in lld actually need this (e.g. Symbol, Section, possibly InputFile), but we abused make<T>(...) for almost everything because of its ease of use.

oontvoo added inline comments.Apr 12 2022, 6:27 AM
lld/Common/Memory.cpp
19

yeah, I've noted that pthread_getspecific could be a bit problematic but in practice I didn't see any performance impact when running the benchmarks.
The advantage of using the ThreadLocal class is that we could put the "tag" inside the context class and that's better encapsulation.
WDYT?

oontvoo added inline comments.Apr 12 2022, 6:35 AM
lld/CMakeLists.txt
228

(sorry - missed this comment previously)

Good idea on using the LLD_ENABLE_THREADS! The reason I added this additional variable was because I wanted to give other ports an option to opt-out of this if they choose to, for whatever reasons.
So if no one objects to enabling this unconditionally (or via the LLD_ENABLE_THREADS), then happy to remove this.

In the LLD_THREAD_SAFE_MEMORY code path, make<Foo>(...) now has significantly larger overhead due to the pthread_getspecific call (via llvm/Support/ThreadLocal.h).
A large number of make<...> instantiations in lld do not benefit from per-thread allocation.

Making individual pieces of code make this distinction (ie., thread safe vs no thread safe) is a bit bug prone. ie.,. a piece of code initially thinks that it doesnt need to be thread safe, but later ends up on a multi-thread code path.. There are a fair number of cases like this in MachO.
That's why we thought it's safer to have one setting: threadsafe for all or not thread safe for all.

Having said that, I understand your concern wrt performance in ELF. So how about offering these:

  • make<>() : can alias to either of the following depending on a flag
  • makeUnsafe<>(): Always use the global/shared allocator
  • makeThreadSafe<>() Always use thread-local allocators

This way, ELF can retain its current behaviour. MachO can choose to be all-thread safe if it wants.

Re: native TLS vs LLVM's ThreadLocal

As noted in the inline comment, I agree that it seems like it could be slower but benchmarks didn't show any difference. I don't have a strong preference either way (except that LLVM"s ThreadLocal simplifies the code a bit ....)
So if anyone absolutely has a strong objection to either approach, please raise your concern now. If not, I will go with the more popular recommendation on this patch.

MaskRay added inline comments.Apr 12 2022, 10:35 AM
lld/Common/Memory.cpp
19

Well, I observed a close to 1% regression when linking chromium :(

oontvoo updated this revision to Diff 422304.Apr 12 2022, 11:40 AM

put back differet make/saver options:

  • make<>() behaviour remained unchanged (since the macro that controls it is OFF by default.)
  • added back the other two explicit makeUnsafe and makePerThread

Can someone summarize the high level roadmap of this feature? If threadsafe allocation is going to be optional/opt-in, lld can't generally rely on that, right? Or is it meant as a gradual way of merging the code and trying it out, and after it's ready to be enabled unconditionally, it can be relied upon in lld in general?

(So far I would expect that lld doesn't do any allocations from threaded code? And for the recent/upcoming refactorings to make lld usable as a library, the per-session context should hold the allocator, right?)

Secondly, about the choice of mechanism for the thread local data, I'm not familiar with the implications of llvm::sys::ThreadLocal, but for plain C++ thread_local, it's generally preferable if the thread local variable is made e.g. static within one single source file (with an accessor), instead of being declared in a header, accessed from any translation unit. If being accessed directly from multiple translation units, the use of the thread local variable incurs some amount of extra tls wrapper functions and weak symbols, which are occasionally broken in e.g. GCC on Windows. (See e.g. D111779, where a TLS variable in LLDB was moved this way, to fix GCC builds of it.)

lattner removed a subscriber: lattner.Apr 12 2022, 12:29 PM
oontvoo updated this revision to Diff 422315.Apr 12 2022, 12:44 PM

reverted threadtag back native TLS

! In D122922#3446376, @mstorsjo wrote:

Can someone summarize the high level roadmap of this feature? If threadsafe allocation is going to be optional/opt-in, lld can't generally rely on that, right? Or is it meant as a gradual way of merging the code and trying it out, and after it's ready to be enabled unconditionally, it can be relied upon in lld in general?

Yes, as it stands, the threadsafety is controlled by the new LLD_THREAD_SAFE_MEMORY. The intention is for each LLD ports to decide whether they want threadsafe allocator/saver (and not for *users* of lld).
For MachO, I think we'd like to enable it (already been hit by a few race-condition bugs). Can't speak for ELF/others ...

(So far I would expect that lld doesn't do any allocations from threaded code?

why not? can you expand on this?

And for the recent/upcoming refactorings to make lld usable as a library, the per-session context should hold the allocator, right?)

not super familiar with this effort. What is the scope of this "per-session" ? Is is per-process? If so, it doesn't solve the problem (here we have multiple threads using the same bptrAlloc, hence the race condition)

Secondly, about the choice of mechanism for the thread local data, I'm not familiar with the implications of llvm::sys::ThreadLocal, but for plain C++ thread_local, it's generally preferable if the thread local variable is made e.g. static within one single source file (with an accessor), instead of being declared in a header, accessed from any translation unit. If being accessed directly from multiple translation units, the use of the thread local variable incurs some amount of extra tls wrapper functions and weak symbols, which are occasionally broken in e.g. GCC on Windows. (See e.g. D111779, where a TLS variable in LLDB was moved this way, to fix GCC builds of it.)

Right, the native TLS variable would be static (not in a header). Pls See updated diff.

! In D122922#3446376, @mstorsjo wrote:

Can someone summarize the high level roadmap of this feature? If threadsafe allocation is going to be optional/opt-in, lld can't generally rely on that, right? Or is it meant as a gradual way of merging the code and trying it out, and after it's ready to be enabled unconditionally, it can be relied upon in lld in general?

Yes, as it stands, the threadsafety is controlled by the new LLD_THREAD_SAFE_MEMORY. The intention is for each LLD ports to decide whether they want threadsafe allocator/saver (and not for *users* of lld).
For MachO, I think we'd like to enable it (already been hit by a few race-condition bugs). Can't speak for ELF/others ...

Right, but if the MachO part of lld does allocations from multiple threads, then it's essentially unreliable unless this option is enabled? So anyone building lld wanting to use the MachO part of it would need to enable it.

(So far I would expect that lld doesn't do any allocations from threaded code?

why not? can you expand on this?

I'm mostly familiar with the COFF parts of lld, and there, most of the linker logic is serial, and parallelism is only used for short blocks of parallelForEach or parallelSort, where no allocations are assumed to be done.

And for the recent/upcoming refactorings to make lld usable as a library, the per-session context should hold the allocator, right?)

not super familiar with this effort. What is the scope of this "per-session" ? Is is per-process? If so, it doesn't solve the problem (here we have multiple threads using the same bptrAlloc, hence the race condition)

No, it's meant so that you can use lld as a library, so that one process can have multiple threads running in parallel, where more than one such thread can call e.g. lld::coff::link() at the same time. I haven't followed the state of the work on that in very close detail, but as far as I understand, the implementation strategy has been to gather everything that previously was global, into a per-invocation context, that is passed around (and/or stored as a thread local variable somewhere I think). So within each linker invocation, you'd only ever have allocations happening on one thread, the one where the user called lld::*::link() - each such linker job then would run multiple threads for occasional parallelism in the linking process though.

Does the MachO linker run longer lived threads within one linker invocation, where any of them can do allocations?

I think it'd be valuable to align these two thread safety efforts so we don't end up with two mechanisms for doing the same. I think @aganea has been involved in that effort though. @aganea - can you chime in here?

Secondly, about the choice of mechanism for the thread local data, I'm not familiar with the implications of llvm::sys::ThreadLocal, but for plain C++ thread_local, it's generally preferable if the thread local variable is made e.g. static within one single source file (with an accessor), instead of being declared in a header, accessed from any translation unit. If being accessed directly from multiple translation units, the use of the thread local variable incurs some amount of extra tls wrapper functions and weak symbols, which are occasionally broken in e.g. GCC on Windows. (See e.g. D111779, where a TLS variable in LLDB was moved this way, to fix GCC builds of it.)

Right, the native TLS variable would be static (not in a header). Pls See updated diff.

Thanks, so LLVM_THREAD_LOCAL AllocContext *threadContext = nullptr; should indeed work just fine.

not super familiar with this effort. What is the scope of this "per-session" ? Is is per-process? If so, it doesn't solve the problem (here we have multiple threads using the same bptrAlloc, hence the race condition)

No, it's meant so that you can use lld as a library, so that one process can have multiple threads running in parallel, where more than one such thread can call e.g. lld::coff::link() at the same time. I haven't followed the state of the work on that in very close detail, but as far as I understand, the implementation strategy has been to gather everything that previously was global, into a per-invocation context, that is passed around (and/or stored as a thread local variable somewhere I think). So within each linker invocation, you'd only ever have allocations happening on one thread, the one where the user called lld::*::link() - each such linker job then would run multiple threads for occasional parallelism in the linking process though.
[...]
I think it'd be valuable to align these two thread safety efforts so we don't end up with two mechanisms for doing the same. I think @aganea has been involved in that effort though. @aganea - can you chime in here?

There are two goals:

  1. The first is the "library-ification", to allow executing any number of sequential "link sessions" from a third-party app (or even from within lld.exe). See discussions in D108850, D110450 and D119049. We need all state related to a "link session" to be contained into a "context", the CommonLinkerContext. Each LLD driver would then implement its own derived class, see COFFLinkerContext for example. This means any state cannot be global, including thread_local (but llvm::sys::ThreadLocal is fine).
  1. The second is related to multithreaded cooperation, when several "link sessions" will be running in parallel in a single process. The is mainly for D86351, but it could be useful for a LLD-as-a-daemon server. We can discuss all this later, but I think the containing application should host a single ThreadPool and pass it as a reference to lld::safeLldMain(). Tasks from any "link session" would then be then queued on the same global ThreadPool.

Secondly, about the choice of mechanism for the thread local data, I'm not familiar with the implications of llvm::sys::ThreadLocal, but for plain C++ thread_local, it's generally preferable if the thread local variable is made e.g. static within one single source file (with an accessor), instead of being declared in a header, accessed from any translation unit. If being accessed directly from multiple translation units, the use of the thread local variable incurs some amount of extra tls wrapper functions and weak symbols, which are occasionally broken in e.g. GCC on Windows. (See e.g. D111779, where a TLS variable in LLDB was moved this way, to fix GCC builds of it.)

Right, the native TLS variable would be static (not in a header). Pls See updated diff.

Thanks, so LLVM_THREAD_LOCAL AllocContext *threadContext = nullptr; should indeed work just fine.

This will not satisfy point 2 above. This would prevent concurrent CompilerLinkerContexts, LLVM_THREAD_LOCAL acts like a per-thread static. Calls to make<>() would add on the same thread-local BumpPtrAllocator, regardless of the CompilerLinkerContext. The llvm::sys::ThreadLocal solves that by allocating & freeing a separate TLS slot for each CompilerLinkerContext. The code in ~CommonLinkerContext() will otherwise not work. We can fix this later if you wish, but it all defeats the purpose of CompilerLinkerContext.

The 1% divergence that @MaskRay mentionned could be fixed if we passed AllocContext where necessary?

! In D122922#3446376, @mstorsjo wrote:
Secondly, about the choice of mechanism for the thread local data, I'm not familiar with the implications of llvm::sys::ThreadLocal, but for plain C++ thread_local, it's generally preferable if the thread local variable is made e.g. static within one single source file (with an accessor), instead of being declared in a header, accessed from any translation unit. If being accessed directly from multiple translation units, the use of the thread local variable incurs some amount of extra tls wrapper functions and weak symbols, which are occasionally broken in e.g. GCC on Windows. (See e.g. D111779, where a TLS variable in LLDB was moved this way, to fix GCC builds of it.)

Right, the native TLS variable would be static (not in a header). Pls See updated diff.

Worth noting that ThreadLocal is holding a dynamically-allocated index into a runtime TLS table, whereas thread_local effectively adds space into the PE's static TLS table. llvm::sys::ThreadLocal does not suffer from the issue mentioned by @mstorsjo.

lld/include/lld/Common/CommonLinkerContext.h
50

Change to globalContext, as opposed to perThreadContext?

I've posted an alternate implementation, please see D123879:

The main differences are:

  • LLD_THREAD_SAFE_MEMORY was removed
  • AllocContext is always thread-local.
  • Using llvm::sys::ThreadLocal to make TLS allocation dynamic at runtime. This is to accommodate for several instances of CommonLinkerContext running concurrently.
  • No "safe" or "perThread" functions, the APIs remain the same as before.

I did not see any divergence in performance (on Windows) when using a two-stage LLD, built with -DLLVM_INTEGRATED_CRT_ALLOC=rpmalloc, with ThinLTO & -march=native.

int3 added a subscriber: thakis.Apr 19 2022, 12:51 PM

I think it's a bit unfortunate that the library-fication work here is making this change hard to land. @thakis and I brought up this exact concern in the original thread (https://discourse.llvm.org/t/rfc-revisiting-lld-as-a-library-design/58445/)...

It doesn't help that we don't have a good set of benchmarks across all 3 platforms & a CI system to run them. Differences in ad-hoc local measurements are hard to resolve.

I wonder if we could move forward with the LLVM_THREAD_LOCAL approach for now in order to unblock further work on parallelizing LLD. I'm sure we can put together enough parallelization wins in LLD-MachO to dwarf a 1% overhead, but that will probably take a few months. I don't know as much about the other LLD ports but I assume there are similar opportunities. It will be easier to eat a regression later after it has "paid for itself" with wins.

Also, if we really wanted to lock in the win from using thread-locals, I reckon we could use some clever macros to switch between LLVM_THREAD_LOCAL and ThreadLocal at compile time, depending on whether we are building a standalone or a library.

In summary, my asks are: 1. can we ship the THREAD_LOCAL version for now and fix it later? 2. In the meantime, could we start some discussions on a set of benchmarks to make future modifications of cross-platform LLD code easier to analyze?

aganea added a subscriber: sbc100.Apr 22 2022, 7:48 AM

@int3 I don't want to block this, I understand your needs, and I value the intention of this patch. I'd be happy to reintroduce llvm::sys::ThreadLocal later. However I wasn't able to reproduce the 1% regression (see D123879 for numbers), although I don't deny it could exist in some configurations (depending on the machine, OS, LLVM build options, allocator). I'm just wondering if 1% regression isn't acceptable for now in the trunk, so that we can commit a more frictionless version of this patch.

! In D122922#3446376, @mstorsjo wrote:

Can someone summarize the high level roadmap of this feature? If threadsafe allocation is going to be optional/opt-in, lld can't generally rely on that, right? Or is it meant as a gradual way of merging the code and trying it out, and after it's ready to be enabled unconditionally, it can be relied upon in lld in general?

Yes, as it stands, the threadsafety is controlled by the new LLD_THREAD_SAFE_MEMORY. The intention is for each LLD ports to decide whether they want threadsafe allocator/saver (and not for *users* of lld).
For MachO, I think we'd like to enable it (already been hit by a few race-condition bugs). Can't speak for ELF/others ...

Right, but if the MachO part of lld does allocations from multiple threads, then it's essentially unreliable unless this option is enabled? So anyone building lld wanting to use the MachO part of it would need to enable it.

This question was left unanswered -- how do you plan on using LLD_THREAD_SAFE_MEMORY? Will it be enabled only when building for Darwin? What happens for cross-compilation or if LLD_THREAD_SAFE_MEMORY isn't enabled? The make/makePerThread/makeUnsafe APIs seem a bit error-prone to me. Ideally it is a choice that should be avoided, if possible, by the business logic/driver developer.

It'd be nice if other drivers owners would pitch in to accept/veto this patch. @sbc100 @mstorsjo @MaskRay

Right, but if the MachO part of lld does allocations from multiple threads, then it's essentially unreliable unless this option is enabled? So anyone building lld wanting to use the MachO part of it would need to enable it.

This question was left unanswered -- how do you plan on using LLD_THREAD_SAFE_MEMORY? Will it be enabled only when building for Darwin? What happens for cross-compilation or if LLD_THREAD_SAFE_MEMORY isn't enabled? The make/makePerThread/makeUnsafe APIs seem a bit error-prone to me. Ideally it is a choice that should be avoided, if possible, by the business logic/driver developer.

Sorry, forgot to follow up on this. From offline chat with int3, I think decided to remove this and to just use the existing LLD_ENABLE_THREADS instead. That is to say, code would have thread safe saver/make() by default, unless it goes out of its way to call the unsafe() variants.

It also wasn't clear from @MaskRay's comment whehter the 1% slow down was for llvm::ThreadLocal alone or if it also applied to the native TLS approach. Was not able to reproduce the regression from my ends with either approach, and I didn't want to keep updating the patch .

MaskRay added a comment.EditedApr 22 2022, 7:29 PM

Right, but if the MachO part of lld does allocations from multiple threads, then it's essentially unreliable unless this option is enabled? So anyone building lld wanting to use the MachO part of it would need to enable it.

This question was left unanswered -- how do you plan on using LLD_THREAD_SAFE_MEMORY? Will it be enabled only when building for Darwin? What happens for cross-compilation or if LLD_THREAD_SAFE_MEMORY isn't enabled? The make/makePerThread/makeUnsafe APIs seem a bit error-prone to me. Ideally it is a choice that should be avoided, if possible, by the business logic/driver developer.

Sorry, forgot to follow up on this. From offline chat with int3, I think decided to remove this and to just use the existing LLD_ENABLE_THREADS instead. That is to say, code would have thread safe saver/make() by default, unless it goes out of its way to call the unsafe() variants.

It also wasn't clear from @MaskRay's comment whehter the 1% slow down was for llvm::ThreadLocal alone or if it also applied to the native TLS approach. Was not able to reproduce the regression from my ends with either approach, and I didn't want to keep updating the patch .

I have a chrome build on Linux x86-64. Say /tmp/c/0 has a lld built from main branch. /tmp/c/1 is a lld built with this patch.
I have measured this:

% hyperfine --warmup 2 --min-runs 16 "numactl -C 20-27 "{/tmp/c/0,/tmp/c/1}" -flavor gnu @response.txt --threads=8"                                           
Benchmark 1: numactl -C 20-27 /tmp/c/0 -flavor gnu @response.txt --threads=8
  Time (mean ± σ):      5.584 s ±  0.035 s    [User: 9.164 s, System: 2.318 s]                                                                                 
  Range (min … max):    5.536 s …  5.656 s    16 runs                          
                                       
Benchmark 2: numactl -C 20-27 /tmp/c/1 -flavor gnu @response.txt --threads=8                                                                                   
  Time (mean ± σ):      5.637 s ±  0.048 s    [User: 9.205 s, System: 2.305 s]
  Range (min … max):    5.565 s …  5.765 s    16 runs
  
Summary
  'numactl -C 20-27 /tmp/c/0 -flavor gnu @response.txt --threads=8' ran
    1.01 ± 0.01 times faster than 'numactl -C 20-27 /tmp/c/1 -flavor gnu @response.txt --threads=8'
% hyperfine --warmup 2 --min-runs 16 "numactl -C 20-23 "{/tmp/c/0,/tmp/c/1}" -flavor gnu @response.txt --threads=4"                                           
Benchmark 1: numactl -C 20-23 /tmp/c/0 -flavor gnu @response.txt --threads=4
  Time (mean ± σ):      6.227 s ±  0.044 s    [User: 8.547 s, System: 2.122 s]
  Range (min … max):    6.165 s …  6.353 s    16 runs
 
Benchmark 2: numactl -C 20-23 /tmp/c/1 -flavor gnu @response.txt --threads=4
  Time (mean ± σ):      6.260 s ±  0.033 s    [User: 8.605 s, System: 2.096 s]
  Range (min … max):    6.200 s …  6.325 s    16 runs
 
Summary
  'numactl -C 20-23 /tmp/c/0 -flavor gnu @response.txt --threads=4' ran
    1.01 ± 0.01 times faster than 'numactl -C 20-23 /tmp/c/1 -flavor gnu @response.txt --threads=4'

Note that I use mimalloc and tend to care about performance more with a better malloc (mimalloc/tcmalloc/jemalloc/snmalloc/etc) than the glibc malloc.

I think it's a bit unfortunate that the library-fication work here is making this change hard to land

I share the same concern. I know that I replied a +1 on that thread, and at that time I did not pay more attention on the performance.
Now I am probably more on the fence (more so because there are a lot of global variables which I am not sure can be cleaned up in a way not harming performance or code readability so much).
I think the original CommonLinkerContext.h change probably caused 1+% regression. Adding another 1% will be too much.
I am pretty sure llvm/Support/ThreadLocal.h will not be an acceptable solution for ELF due to the pthread_getspecific cost.
thread_local shall be fine and as mentioned previous has been used in several places in llvm and clang.

MaskRay requested changes to this revision.Apr 25 2022, 2:57 PM

Request changes as I think we cannot land this as is.

This revision now requires changes to proceed.Apr 25 2022, 2:57 PM
oontvoo added a comment.EditedApr 26 2022, 10:13 AM

I have a chrome build on Linux x86-64. Say /tmp/c/0 has a lld built from main branch. /tmp/c/1 is a lld built with this patch.

Hi @MaskRay Is this the same chromium repro package we've been using in lld-macho? if not, are you able to share with us the repro?
Thanks!

MaskRay added a comment.EditedApr 26 2022, 10:37 PM

I use chrome as an ELF linker benchmark. After linking chrome, delete it, run ninja -v chrome to get the linker command line, and invoke the linker command line with -Wl,--reproduce=/tmp/chrome.tar to get a reproduce file.

/tmp/c/0: old lld
/tmp/c/1: new lld

I have run the following commands several times and notice that this version is still a bit slower (note that σ in the output is too large, but I have invoked the command many times and the new lld already appears to be a bit slower)

% hyperfine --warmup 2 --min-runs 16 "numactl -C 20-27 "{/tmp/c/0,/tmp/c/1}" -flavor gnu @response.txt --threads=8"
Benchmark 1: numactl -C 20-27 /tmp/c/0 -flavor gnu @response.txt --threads=8
  Time (mean ± σ):      6.163 s ±  0.107 s    [User: 9.653 s, System: 2.674 s]
  Range (min … max):    6.047 s …  6.395 s    16 runs
 
Benchmark 2: numactl -C 20-27 /tmp/c/1 -flavor gnu @response.txt --threads=8
  Time (mean ± σ):      6.213 s ±  0.122 s    [User: 9.619 s, System: 2.662 s]
  Range (min … max):    6.071 s …  6.474 s    16 runs
 
Summary
  'numactl -C 20-27 /tmp/c/0 -flavor gnu @response.txt --threads=8' ran
    1.01 ± 0.03 times faster than 'numactl -C 20-27 /tmp/c/1 -flavor gnu @response.txt --threads=8'