This is an archive of the discontinued LLVM Phabricator instance.

[lld][macho][elf] Teach the bump-allocator in lld/Common about thread-safetiness.
AbandonedPublic

Authored by oontvoo on Mar 14 2022, 1:31 PM.

Details

Reviewers
int3
MaskRay
Group Reviewers
Restricted Project
Summary

The current impl is quite simple: hold a lock briefly while allocating new memory.
(Conceptually the bump-allocator can be implemented without locks using atomic incr', but looking at the current impl, I kind of got discouraged as it seemed too complex)
Performace data: (profiling chromium-framework):

x ./lld_macho_base
+ ./lld_macho_safe_alloc

SYSTEM CPU time:
    N           Min           Max        Median           Avg        Stddev
x   5          0.73          0.83          0.79          0.79   0.037416574
+   5          0.71          0.88          0.77         0.784   0.062689712
No difference proven at 95.0% confidence

USER CPU time:
    N           Min           Max        Median           Avg        Stddev
x   5          3.62          3.73          3.65          3.66   0.043588989
+   5          3.74          3.86          3.84         3.816   0.053665631
Difference at 95.0% confidence
	0.156 +/- 0.0712998
	4.2623% +/- 1.94808%
	(Student's t, pooled s = 0.0488876)

WALL time:
    N           Min           Max        Median           Avg        Stddev
x   5          4.56          4.61          4.59         4.588   0.019235384
+   5          4.68          4.74          4.72         4.714   0.024083189
Difference at 95.0% confidence
	0.126 +/- 0.031786
	2.74629% +/- 0.692808%
	(Student's t, pooled s = 0.0217945)

Use case:
LLD-macho occasionally allocate memory in multiple threads.

Diff Detail

Event Timeline

oontvoo created this revision.Mar 14 2022, 1:31 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMar 14 2022, 1:31 PM
Herald added a reviewer: Restricted Project. · View Herald Transcript
oontvoo requested review of this revision.Mar 14 2022, 1:31 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 14 2022, 1:31 PM
oontvoo edited the summary of this revision. (Show Details)Mar 14 2022, 1:31 PM
int3 requested changes to this revision.EditedMar 14 2022, 1:44 PM

This doesn't fix race conditions between StringSaver and make<>, since StringSaver calls the BumpPtrAllocator directly (without going through SpecificBumpPtrAllocator.)

I was thinking that we could fix https://github.com/llvm/llvm-project/issues/54378 for now by replacing make<TrieNode>() by new TrieNode(), i.e. using the system allocator instead of the BumpPtrAllocator. (Then ~TrieBuilder could handle the freeing.)

(Conceptually the bump-allocator can be implemented without locks using atomic incr', but looking at the current impl, I kind of got discouraged as it seemed too complex)

Yeah, I'd taken a look at it too and it does seem like a bunch of work. I'm thinking that we could maybe (eventually) do a hybrid solution where we use mutexes for the infrequently-called new-slab-allocation codepath, and atomic CAS for the fast pointer bump.

Another option is to use the system allocator whenever we need concurrent allocations, and add some cmake flag that allows us to easily switch to rpmalloc or jemalloc.

This revision now requires changes to proceed.Mar 14 2022, 1:44 PM
int3 added a comment.Mar 14 2022, 1:45 PM

The race with StringSaver is not hypothetical btw; we are using it in parallel with TrieBuilder, via SymtabSection::emitBeginSourceStab().

This doesn't fix race conditions between StringSaver and make<>, since StringSaver calls the BumpPtrAllocator directly (without going through SpecificBumpPtrAllocator.)

ok - that's an orthogonal problem :)

I was thinking that we could fix https://github.com/llvm/llvm-project/issues/54378 for now by replacing make<TrieNode>() by new TrieNode(), i.e. using the system allocator instead of the BumpPtrAllocator. (Then ~TrieBuilder could handle the freeing.)

sure, i'm fine with a quick/temp fix for this

Another option is to use the system allocator whenever we need concurrent allocations, and add some cmake flag that allows us to easily switch to rpmalloc or jemalloc.

I think the current problem is that, a given piece of code doesn't know it should be thread-safe or not. It'd seem weird to have x parts of the linker use a thread-safe allocator, then y remaining parts don't ...
Whichever allocator we end up using, I think it should be use across the whole linker ...

int3 added a comment.Mar 14 2022, 1:57 PM

ok - that's an orthogonal problem :)

I dunno if that's orthogonal, I think that's actually the crux of https://github.com/llvm/llvm-project/issues/54378 -- I don't believe there are any other threads that allocate ATM, aside from the trieBuilder's and SymtabSection's. Using make<> in a single thread running in parallel is fine as long as the other threads don't allocate too...

It'd seem weird to have x parts of the linker use a thread-safe allocator, then y remaining parts don't ...

Mm that's a good point, consistency is clarity. But still, we could switch the entire linker to use the system allocator and see what the perf hit is. Maaaybe rpmalloc/jemalloc performs well enough that the additional BumpPtrAllocator layering isn't necessary.

MaskRay added a comment.EditedMar 14 2022, 9:34 PM

(Conceptually the bump-allocator can be implemented without locks using atomic incr', but looking at the current impl, I kind of got discouraged as it seemed too complex)

I agree it adds complexity and likely slows down single-threading usage.
In addition, modern allocators tend to use multiple arenas instead of competing for one central arena.

The places that concurrent make<Foo>() improves performance are not dominating. It may make sense to create a dedicated make like function for parallelism.
For example, in lld/ELF, I can use makeT<InputSection> in parallel initialization of sections. For others places the current make<Foo> suffices.

oontvoo abandoned this revision.Apr 1 2022, 11:32 AM

abandoning in favour of D122922