This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lld/
-
Common/
-
Memory.cpp
-
MachO/
-
Driver.cpp
-
include/lld/Common/
-
lld/
-
Common/
-
Memory.h

Differential D121635

[lld][macho][elf] Teach the bump-allocator in lld/Common about thread-safetiness.
AbandonedPublic

Authored by oontvoo on Mar 14 2022, 1:31 PM.

Download Raw Diff

Details

Reviewers

int3
MaskRay

Group Reviewers

Restricted Project

Summary

The current impl is quite simple: hold a lock briefly while allocating new memory.
(Conceptually the bump-allocator can be implemented without locks using atomic incr', but looking at the current impl, I kind of got discouraged as it seemed too complex)
Performace data: (profiling chromium-framework):

x ./lld_macho_base
+ ./lld_macho_safe_alloc

SYSTEM CPU time:
    N           Min           Max        Median           Avg        Stddev
x   5          0.73          0.83          0.79          0.79   0.037416574
+   5          0.71          0.88          0.77         0.784   0.062689712
No difference proven at 95.0% confidence

USER CPU time:
    N           Min           Max        Median           Avg        Stddev
x   5          3.62          3.73          3.65          3.66   0.043588989
+   5          3.74          3.86          3.84         3.816   0.053665631
Difference at 95.0% confidence
	0.156 +/- 0.0712998
	4.2623% +/- 1.94808%
	(Student's t, pooled s = 0.0488876)

WALL time:
    N           Min           Max        Median           Avg        Stddev
x   5          4.56          4.61          4.59         4.588   0.019235384
+   5          4.68          4.74          4.72         4.714   0.024083189
Difference at 95.0% confidence
	0.126 +/- 0.031786
	2.74629% +/- 0.692808%
	(Student's t, pooled s = 0.0217945)

Use case:
LLD-macho occasionally allocate memory in multiple threads.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

oontvoo created this revision.Mar 14 2022, 1:31 PM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMar 14 2022, 1:31 PM

Herald added a reviewer: Restricted Project. · View Herald Transcript

oontvoo requested review of this revision.Mar 14 2022, 1:31 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 14 2022, 1:31 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

oontvoo edited the summary of this revision. (Show Details)Mar 14 2022, 1:31 PM

This doesn't fix race conditions between StringSaver and make<>, since StringSaver calls the BumpPtrAllocator directly (without going through SpecificBumpPtrAllocator.)

I was thinking that we could fix https://github.com/llvm/llvm-project/issues/54378 for now by replacing make<TrieNode>() by new TrieNode(), i.e. using the system allocator instead of the BumpPtrAllocator. (Then ~TrieBuilder could handle the freeing.)

(Conceptually the bump-allocator can be implemented without locks using atomic incr', but looking at the current impl, I kind of got discouraged as it seemed too complex)

Yeah, I'd taken a look at it too and it does seem like a bunch of work. I'm thinking that we could maybe (eventually) do a hybrid solution where we use mutexes for the infrequently-called new-slab-allocation codepath, and atomic CAS for the fast pointer bump.

Another option is to use the system allocator whenever we need concurrent allocations, and add some cmake flag that allows us to easily switch to rpmalloc or jemalloc.

This revision now requires changes to proceed.Mar 14 2022, 1:44 PM

The race with StringSaver is not hypothetical btw; we are using it in parallel with TrieBuilder, via SymtabSection::emitBeginSourceStab().

In D121635#3380515, @int3 wrote:

This doesn't fix race conditions between StringSaver and make<>, since StringSaver calls the BumpPtrAllocator directly (without going through SpecificBumpPtrAllocator.)

ok - that's an orthogonal problem :)

I was thinking that we could fix https://github.com/llvm/llvm-project/issues/54378 for now by replacing make<TrieNode>() by new TrieNode(), i.e. using the system allocator instead of the BumpPtrAllocator. (Then ~TrieBuilder could handle the freeing.)

sure, i'm fine with a quick/temp fix for this

Another option is to use the system allocator whenever we need concurrent allocations, and add some cmake flag that allows us to easily switch to rpmalloc or jemalloc.

I think the current problem is that, a given piece of code doesn't know it should be thread-safe or not. It'd seem weird to have x parts of the linker use a thread-safe allocator, then y remaining parts don't ...
Whichever allocator we end up using, I think it should be use across the whole linker ...

ok - that's an orthogonal problem :)

I dunno if that's orthogonal, I think that's actually the crux of https://github.com/llvm/llvm-project/issues/54378 -- I don't believe there are any other threads that allocate ATM, aside from the trieBuilder's and SymtabSection's. Using make<> in a single thread running in parallel is fine as long as the other threads don't allocate too...

It'd seem weird to have x parts of the linker use a thread-safe allocator, then y remaining parts don't ...

Mm that's a good point, consistency is clarity. But still, we could switch the entire linker to use the system allocator and see what the perf hit is. Maaaybe rpmalloc/jemalloc performs well enough that the additional BumpPtrAllocator layering isn't necessary.

Harbormaster completed remote builds in B154171: Diff 415200.Mar 14 2022, 2:06 PM

(Conceptually the bump-allocator can be implemented without locks using atomic incr', but looking at the current impl, I kind of got discouraged as it seemed too complex)

I agree it adds complexity and likely slows down single-threading usage.
In addition, modern allocators tend to use multiple arenas instead of competing for one central arena.

The places that concurrent make<Foo>() improves performance are not dominating. It may make sense to create a dedicated make like function for parallelism.
For example, in lld/ELF, I can use makeT<InputSection> in parallel initialization of sections. For others places the current make<Foo> suffices.

abandoning in favour of D122922

Herald added a subscriber: StephenFan. · View Herald TranscriptApr 1 2022, 11:32 AM

Revision Contents

Path

Size

lld/

Common/

Memory.cpp

2 lines

MachO/

Driver.cpp

1 line

include/

lld/

Common/

Memory.h

32 lines

Diff 415200

lld/Common/Memory.cpp

	//===- Memory.cpp ---------------------------------------------------------===//			//===- Memory.cpp ---------------------------------------------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "lld/Common/Memory.h"			#include "lld/Common/Memory.h"
	#include "lld/Common/CommonLinkerContext.h"			#include "lld/Common/CommonLinkerContext.h"

	using namespace llvm;			using namespace llvm;
	using namespace lld;			using namespace lld;

				bool lld::ThreadSafeAlloc = false;

	SpecificAllocBase *			SpecificAllocBase *
	lld::SpecificAllocBase::getOrCreate(void *tag, size_t size, size_t align,			lld::SpecificAllocBase::getOrCreate(void *tag, size_t size, size_t align,
	SpecificAllocBase (&creator)(void )) {			SpecificAllocBase (&creator)(void )) {
	auto &instances = context().instances;			auto &instances = context().instances;
	auto &instance = instances[tag];			auto &instance = instances[tag];
	if (instance == nullptr) {			if (instance == nullptr) {
	void *storage = context().bAlloc.Allocate(size, align);			void *storage = context().bAlloc.Allocate(size, align);
	instance = creator(storage);			instance = creator(storage);
	}			}
	return instance;			return instance;
	}			}

lld/MachO/Driver.cpp

Show First 20 Lines • Show All 1,101 Lines • ▼ Show 20 Lines	bool macho::link(ArrayRef<const char *> argsArr, llvm::raw_ostream &stdoutOS,
config = std::make_unique<Configuration>();		config = std::make_unique<Configuration>();
symtab = std::make_unique<SymbolTable>();		symtab = std::make_unique<SymbolTable>();
target = createTargetInfo(args);		target = createTargetInfo(args);
depTracker = std::make_unique<DependencyTracker>(		depTracker = std::make_unique<DependencyTracker>(
args.getLastArgValue(OPT_dependency_info));		args.getLastArgValue(OPT_dependency_info));
if (errorCount())		if (errorCount())
return false;		return false;

		lld::ThreadSafeAlloc = true;
if (args.hasArg(OPT_pagezero_size)) {		if (args.hasArg(OPT_pagezero_size)) {
uint64_t pagezeroSize = args::getHex(args, OPT_pagezero_size, 0);		uint64_t pagezeroSize = args::getHex(args, OPT_pagezero_size, 0);

// ld64 does something really weird. It attempts to realign the value to the		// ld64 does something really weird. It attempts to realign the value to the
// page size, but assumes the the page size is 4K. This doesn't work with		// page size, but assumes the the page size is 4K. This doesn't work with
// most of Apple's ARM64 devices, which use a page size of 16K. This means		// most of Apple's ARM64 devices, which use a page size of 16K. This means
// that it will first 4K align it by rounding down, then round up to 16K.		// that it will first 4K align it by rounding down, then round up to 16K.
// This probably only happened because no one using this arg with anything		// This probably only happened because no one using this arg with anything
▲ Show 20 Lines • Show All 424 Lines • Show Last 20 Lines

lld/include/lld/Common/Memory.h

	Show All 16 Lines
	// Most objects are allocated using the arena allocators defined by this file.			// Most objects are allocated using the arena allocators defined by this file.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLD_COMMON_MEMORY_H			#ifndef LLD_COMMON_MEMORY_H
	#define LLD_COMMON_MEMORY_H			#define LLD_COMMON_MEMORY_H

	#include "llvm/Support/Allocator.h"			#include "llvm/Support/Allocator.h"
				#include <mutex>

	namespace lld {			namespace lld {

				// Set to "true" for a thread-safe (but ~2.74% slower) allocator.
				extern bool ThreadSafeAlloc;

	// A base class only used by the CommonLinkerContext to keep track of the			// A base class only used by the CommonLinkerContext to keep track of the
	// SpecificAlloc<> instances.			// SpecificAlloc<> instances.
	struct SpecificAllocBase {			struct SpecificAllocBase {
	virtual ~SpecificAllocBase() = default;			virtual ~SpecificAllocBase() = default;
	static SpecificAllocBase getOrCreate(void tag, size_t size, size_t align,			static SpecificAllocBase getOrCreate(void tag, size_t size, size_t align,
	SpecificAllocBase (&creator)(void ));			SpecificAllocBase (&creator)(void ));
	};			};

	// An arena of specific types T, created on-demand.			// An arena of specific types T, created on-demand.
	template <class T> struct SpecificAlloc : public SpecificAllocBase {			template <class T> struct SpecificAlloc : public SpecificAllocBase {
	static SpecificAllocBase create(void storage) {			static SpecificAllocBase create(void storage) {
	return new (storage) SpecificAlloc<T>();			return new (storage) SpecificAlloc<T>();
	}			}
	llvm::SpecificBumpPtrAllocator<T> alloc;			llvm::SpecificBumpPtrAllocator<T> alloc;
	static int tag;			static int tag;

				std::mutex mutex;
	};			};

	// The address of this static member is only used as a key in			// The address of this static member is only used as a key in
	// CommonLinkerContext::instances. Its value does not matter.			// CommonLinkerContext::instances. Its value does not matter.
	template <class T> int SpecificAlloc<T>::tag = 0;			template <class T> int SpecificAlloc<T>::tag = 0;

	// Creates the arena on-demand on the first call; or returns it, if it was			// Creates the arena on-demand on the first call; or returns it, if it was
	// already created.			// already created.
	template <typename T>			template <typename T>
	inline llvm::SpecificBumpPtrAllocator<T> &getSpecificAllocSingleton() {			inline SpecificAlloc<T> *getSpecificAllocSingletonHelper() {
	SpecificAllocBase *instance = SpecificAllocBase::getOrCreate(			SpecificAllocBase *instance = SpecificAllocBase::getOrCreate(
	&SpecificAlloc<T>::tag, sizeof(SpecificAlloc<T>),			&SpecificAlloc<T>::tag, sizeof(SpecificAlloc<T>),
	alignof(SpecificAlloc<T>), SpecificAlloc<T>::create);			alignof(SpecificAlloc<T>), SpecificAlloc<T>::create);
	return ((SpecificAlloc<T> *)instance)->alloc;			return (SpecificAlloc<T> *)instance;
				}

				template <typename T>
				inline llvm::SpecificBumpPtrAllocator<T> &getSpecificAllocSingleton() {
				return getSpecificAllocSingletonHelper<T>()->alloc;
				}

				template <typename T> inline T *DoAlloc() {
				{
				auto *allocator = getSpecificAllocSingletonHelper<T>();
				// TODO: maybe make the flag a compile-time config to avoid this branch.
				if (ThreadSafeAlloc) {
				std::lock_guard<std::mutex> lock(allocator->mutex);
				return allocator->alloc.Allocate();
				} else {
				return allocator->alloc.Allocate();
				}
				}
	}			}

	// Creates new instances of T off a (almost) contiguous arena/object pool. The			// Creates new instances of T off a (almost) contiguous arena/object pool. The
	// instances are destroyed whenever lldMain() goes out of scope.			// instances are destroyed whenever lldMain() goes out of scope.
	template <typename T, typename... U> T *make(U &&... args) {			template <typename T, typename... U> T *make(U &&... args) {
	return new (getSpecificAllocSingleton<T>().Allocate())			return new (DoAlloc<T>()) T(std::forward<U>(args)...);
	T(std::forward<U>(args)...);
	}			}

	} // namespace lld			} // namespace lld

	#endif			#endif