This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lld/
-
Common/
-
CommonLinkerContext.cpp
-
Memory.cpp
-
ELF/
-
Symbols.h
-
include/lld/Common/
-
lld/
-
Common/
1/2
CommonLinkerContext.h
-
Memory.h

Differential D123879

[LLD] Alternate implementation for "Support per-thread allocators and StringSavers"
Needs ReviewPublic

Authored by aganea on Apr 15 2022, 3:40 PM.

Download Raw Diff

Details

Reviewers

MaskRay
oontvoo
mstorsjo
int3

Summary

This is an alternate implementation based on @oontvoo's D122922.

The main differences are:

LLD_THREAD_SAFE_MEMORY was removed
AllocContext is always thread-local.
Using llvm::sys::ThreadLocal to make TLS allocation dynamic at runtime. This is to accommodate for several instances of CommonLinkerContext running concurrently.
No "safe" or "perThread" functions, the APIs remain the same as before.

I did not see any divergence in performance when using a two-stage LLD, built with -DLLVM_INTEGRATED_CRT_ALLOC=rpmalloc, with ThinLTO & -march=native.

Chromium's chrome.dll:

D:\git\chromium\src\out\Default>hyperfine "d:\git\llvm-project\stage2_rpmalloc\bin\_globalbump\lld-link.exe @__link_chrome_dll.rsp" "d:\git\llvm-project\stage2_rpmalloc\bin\_tlbump\lld-link.exe @__link_chrome_dll.rsp"
Benchmark 1: d:\git\llvm-project\stage2_rpmalloc\bin\_globalbump\lld-link.exe @__link_chrome_dll.rsp
  Time (mean ± σ):     10.971 s ±  0.037 s    [User: 0.001 s, System: 0.001 s]
  Range (min … max):   10.913 s … 11.044 s    10 runs

Benchmark 2: d:\git\llvm-project\stage2_rpmalloc\bin\_tlbump\lld-link.exe @__link_chrome_dll.rsp
  Time (mean ± σ):     10.974 s ±  0.050 s    [User: 0.000 s, System: 0.001 s]
  Range (min … max):   10.908 s … 11.072 s    10 runs

Summary
  'd:\git\llvm-project\stage2_rpmalloc\bin\_globalbump\lld-link.exe @__link_chrome_dll.rsp' ran
    1.00 ± 0.01 times faster than 'd:\git\llvm-project\stage2_rpmalloc\bin\_tlbump\lld-link.exe @__link_chrome_dll.rsp'

Chromium's unit_tests.exe:

D:\git\chromium\src\out\Default>hyperfine "d:\git\llvm-project\stage2_rpmalloc\bin\_globalbump\lld-link.exe @__link_unit_tests.rsp" "d:\git\llvm-project\stage2_rpmalloc\bin\_tlbump\lld-link.exe @__link_unit_tests.rsp"
Benchmark 1: d:\git\llvm-project\stage2_rpmalloc\bin\_globalbump\lld-link.exe @__link_unit_tests.rsp
  Time (mean ± σ):     17.512 s ±  0.197 s    [User: 0.001 s, System: 0.001 s]
  Range (min … max):   17.311 s … 17.933 s    10 runs

Benchmark 2: d:\git\llvm-project\stage2_rpmalloc\bin\_tlbump\lld-link.exe @__link_unit_tests.rsp
  Time (mean ± σ):     17.509 s ±  0.080 s    [User: 0.001 s, System: 0.003 s]
  Range (min … max):   17.387 s … 17.658 s    10 runs

Summary
  'd:\git\llvm-project\stage2_rpmalloc\bin\_tlbump\lld-link.exe @__link_unit_tests.rsp' ran
    1.00 ± 0.01 times faster than 'd:\git\llvm-project\stage2_rpmalloc\bin\_globalbump\lld-link.exe @__link_unit_tests.rsp'

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

aganea created this revision.Apr 15 2022, 3:40 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 15 2022, 3:40 PM

Herald added subscribers: StephenFan, arichardson, emaste. · View Herald Transcript

aganea requested review of this revision.Apr 15 2022, 3:40 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 15 2022, 3:40 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

aganea mentioned this in D122922: [lld][common][lld-macho][lld-coff] Support per-thread allocators and StringSavers.Apr 15 2022, 3:41 PM

This is just for demonstrative purposes, @oontvoo feel free to cherry-pick anything of interest into the other patch.

Harbormaster completed remote builds in B159888: Diff 423179.Apr 15 2022, 3:59 PM

oontvoo added inline comments.Apr 18 2022, 6:33 AM

lld/include/lld/Common/CommonLinkerContext.h
54	The reason I reverted the revision that used ThreadLocal was because there was a ~1% regression linking chromium observed by @MaskRay . (similarly for preserving the options of using non-threadlocal make()/saver() ) While I would agree with you that the code would so much simpler (like in this patch) to have all ports unconditionally use this new thread-safe allocators, I'm not sure we should do that at the cost of performance regressions. For Macho, I didn't see any difference either way, but ELF seemed to get slower.

aganea added inline comments.Apr 18 2022, 6:42 AM

lld/include/lld/Common/CommonLinkerContext.h
54	@oontvoo I agree with you, out of the box this patch can cause pref. regressions to existing code. The COFF driver doesn't use that much allocations in multi-threaded code, that's why I don't see differences. Although the regressions are solvable in my view. @MaskRay are you able to pinpoint in which user code the 1% regression is coming from? Can we retrieve the `SpecificAlloc<>` at a higher level in those cases, and then call `SpecificAlloc<>.make()`, to avoid fetching the TLS address on every `make()` call?

I see no difference on the ELF side, but @MaskRay maybe you're running a more through test? This is a two-stage LLD, second stage uses -march=native -Xclang -O3 -fstrict-aliasing -fwhole-program-vtables -flto=thin, running on a c5a.24xlarge EC2 instance.

ubuntu@XXX:~/chromium/src/out/Default$ uname -a
Linux XXX 5.13.0-1021-aws #23~20.04.2-Ubuntu SMP Thu Mar 31 11:36:15 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
ubuntu@XXX:~/chromium/src/out/Default$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          96
On-line CPU(s) list:             0-95
Thread(s) per core:              2
Core(s) per socket:              48
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7R32
Stepping:                        0
CPU MHz:                         2799.992
BogoMIPS:                        5599.98
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       1.5 MiB
L1i cache:                       1.5 MiB
L2 cache:                        24 MiB
L3 cache:                        192 MiB
NUMA node0 CPU(s):               0-95
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; LFENCE, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
                                 fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmu
                                 lqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_lega
                                 cy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 r
                                 dseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid

(bin_tls_bumpptr is with this patch, bin_no_tls is without, using Chromium checkout 8660f8deda4982de98baddafaffb651898144bac and LLVM checkout b859c39c40a79ff74033f67d807a18130b9afe30)

ubuntu@XXX:~/chromium/src/out/Default$ hyperfine '~/llvm-project/stage2/bin_tls_bumpptr/ld.lld @__link_unit_tests.rsp' '~/llvm-project/stage2/bin_no_tls/ld.lld @__link_unit_tests.rsp'
Benchmark 1: ~/llvm-project/stage2/bin_tls_bumpptr/ld.lld @__link_unit_tests.rsp
  Time (mean ± σ):      9.743 s ±  0.057 s    [User: 21.273 s, System: 78.149 s]
  Range (min … max):    9.661 s …  9.845 s    10 runs

Benchmark 2: ~/llvm-project/stage2/bin_no_tls/ld.lld @__link_unit_tests.rsp
  Time (mean ± σ):      9.770 s ±  0.046 s    [User: 21.241 s, System: 78.736 s]
  Range (min … max):    9.695 s …  9.844 s    10 runs

Summary
  '~/llvm-project/stage2/bin_tls_bumpptr/ld.lld @__link_unit_tests.rsp' ran
    1.00 ± 0.01 times faster than '~/llvm-project/stage2/bin_no_tls/ld.lld @__link_unit_tests.rsp'

ubuntu@XXX:~/chromium/src/out/Default$ hyperfine '~/llvm-project/stage2/bin_tls_bumpptr/ld.lld @__link_chrome.rsp' '~/llvm-project/stage2/bin_no_tls/ld.lld @__link_chrome.rsp'
Benchmark 1: ~/llvm-project/stage2/bin_tls_bumpptr/ld.lld @__link_chrome.rsp
  Time (mean ± σ):      6.277 s ±  0.077 s    [User: 13.156 s, System: 50.551 s]
  Range (min … max):    6.218 s …  6.484 s    10 runs

Benchmark 2: ~/llvm-project/stage2/bin_no_tls/ld.lld @__link_chrome.rsp
  Time (mean ± σ):      6.275 s ±  0.025 s    [User: 13.274 s, System: 50.396 s]
  Range (min … max):    6.233 s …  6.314 s    10 runs

Summary
  '~/llvm-project/stage2/bin_no_tls/ld.lld @__link_chrome.rsp' ran
    1.00 ± 0.01 times faster than '~/llvm-project/stage2/bin_tls_bumpptr/ld.lld @__link_chrome.rsp'

MaskRay mentioned this in D130810: [ELF] Parallelize input section initialization.Jul 29 2022, 5:18 PM

MaskRay mentioned this in rGf6bd0a8f2bc4: [ELF] Add makeThreadLocal/makeThreadLocalN and remove InputFile::localSymStorage.Aug 4 2022, 11:09 AM

Revision Contents

Path

Size

lld/

Common/

CommonLinkerContext.cpp

22 lines

Memory.cpp

8 lines

ELF/

Symbols.h

6 lines

include/

lld/

Common/

CommonLinkerContext.h

31 lines

Memory.h

37 lines

Diff 423179

lld/Common/CommonLinkerContext.cpp

	Show All 22 Lines
	static CommonLinkerContext *lctx;			static CommonLinkerContext *lctx;

	CommonLinkerContext::CommonLinkerContext() {			CommonLinkerContext::CommonLinkerContext() {
	lctx = this;			lctx = this;
	// Fire off the static initializations in CGF's constructor.			// Fire off the static initializations in CGF's constructor.
	codegen::RegisterCodeGenFlags CGF;			codegen::RegisterCodeGenFlags CGF;
	}			}

				AllocContext *CommonLinkerContext::alloc() {
				AllocContext *context = contextIndex.get();
				if (!context) {
				// Context didn't exist yet for this thread, so create a new one.
				context = new AllocContext;
				contextIndex.set(context);

				llvm::sys::ScopedWriter lock(allContextsMutex);
				allContexts.push_back(context);
				}
				return context;
				}

	CommonLinkerContext::~CommonLinkerContext() {			CommonLinkerContext::~CommonLinkerContext() {
	assert(lctx);			assert(lctx);
	// Explicitly call the destructors since we created the objects with placement			// Explicitly call the destructors since we created the objects with placement
	// new in SpecificAlloc::create().			// new in SpecificAlloc::create().
	for (auto &it : instances)			llvm::sys::ScopedWriter lock(allContextsMutex);
	it.second->~SpecificAllocBase();			for (AllocContext *context : allContexts) {
				for (auto &instance : context->instances)
				instance.second->~SpecificAllocBase();
				delete context;
				}
	lctx = nullptr;			lctx = nullptr;
	}			}

	CommonLinkerContext &lld::commonContext() {			CommonLinkerContext &lld::commonContext() {
	assert(lctx);			assert(lctx);
	return *lctx;			return *lctx;
	}			}

	bool lld::hasContext() { return lctx != nullptr; }			bool lld::hasContext() { return lctx != nullptr; }

	void CommonLinkerContext::destroy() {			void CommonLinkerContext::destroy() {
	if (lctx == nullptr)			if (lctx == nullptr)
	return;			return;
	delete lctx;			delete lctx;
				lctx = nullptr;
	}			}

lld/Common/Memory.cpp

	Show All 9 Lines
	#include "lld/Common/CommonLinkerContext.h"			#include "lld/Common/CommonLinkerContext.h"

	using namespace llvm;			using namespace llvm;
	using namespace lld;			using namespace lld;

	SpecificAllocBase *			SpecificAllocBase *
	lld::SpecificAllocBase::getOrCreate(void *tag, size_t size, size_t align,			lld::SpecificAllocBase::getOrCreate(void *tag, size_t size, size_t align,
	SpecificAllocBase (&creator)(void )) {			SpecificAllocBase (&creator)(void )) {
	auto &instances = context().instances;			AllocContext *threadContext = context().alloc();
	auto &instance = instances[tag];			auto &instance = threadContext->instances[tag];
	if (instance == nullptr) {			if (!instance) {
	void *storage = context().bAlloc.Allocate(size, align);			void *storage = threadContext->bAlloc.Allocate(size, align);
	instance = creator(storage);			instance = creator(storage);
	}			}
	return instance;			return instance;
	}			}

lld/ELF/Symbols.h

Show First 20 Lines • Show All 548 Lines • ▼ Show 20 Lines	void Symbol::replace(const Symbol &other) {

// Print out a log message if --trace-symbol was specified.		// Print out a log message if --trace-symbol was specified.
// This is for debugging.		// This is for debugging.
if (traced)		if (traced)
printTraceSymbol(*this, getName());		printTraceSymbol(*this, getName());
}		}

template <typename... T> Defined *makeDefined(T &&...args) {		template <typename... T> Defined *makeDefined(T &&...args) {
return new (reinterpret_cast<Defined *>(		auto &bumpAllocator = SpecificAlloc<SymbolUnion>::get().allocator;
getSpecificAllocSingleton<SymbolUnion>().Allocate()))		void *objectStorage = bumpAllocator.Allocate();
Defined(std::forward<T>(args)...);		return new (objectStorage) Defined(std::forward<T>(args)...);
}		}

void reportDuplicate(const Symbol &sym, const InputFile *newFile,		void reportDuplicate(const Symbol &sym, const InputFile *newFile,
InputSectionBase *errSec, uint64_t errOffset);		InputSectionBase *errSec, uint64_t errOffset);
void maybeWarnUnorderableSymbol(const Symbol *sym);		void maybeWarnUnorderableSymbol(const Symbol *sym);
bool computeIsPreemptible(const Symbol &sym);		bool computeIsPreemptible(const Symbol &sym);
void reportBackrefs();		void reportBackrefs();

} // namespace elf		} // namespace elf
} // namespace lld		} // namespace lld

#endif		#endif

lld/include/lld/Common/CommonLinkerContext.h

	Show All 15 Lines
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLD_COMMON_COMMONLINKINGCONTEXT_H			#ifndef LLD_COMMON_COMMONLINKINGCONTEXT_H
	#define LLD_COMMON_COMMONLINKINGCONTEXT_H			#define LLD_COMMON_COMMONLINKINGCONTEXT_H

	#include "lld/Common/ErrorHandler.h"			#include "lld/Common/ErrorHandler.h"
	#include "lld/Common/Memory.h"			#include "lld/Common/Memory.h"
				#include "llvm/Support/RWMutex.h"
	#include "llvm/Support/StringSaver.h"			#include "llvm/Support/StringSaver.h"
				#include "llvm/Support/ThreadLocal.h"

	namespace llvm {			namespace llvm {
	class raw_ostream;			class raw_ostream;
	} // namespace llvm			} // namespace llvm

	namespace lld {			namespace lld {
	struct SpecificAllocBase;			struct SpecificAllocBase;

				struct AllocContext {
				llvm::BumpPtrAllocator bAlloc;
				llvm::StringSaver saver{bAlloc};
				llvm::DenseMap<void , SpecificAllocBase > instances;
				};

	class CommonLinkerContext {			class CommonLinkerContext {
	public:			public:
	CommonLinkerContext();			CommonLinkerContext();
	virtual ~CommonLinkerContext();			virtual ~CommonLinkerContext();

	static void destroy();			static void destroy();

	llvm::BumpPtrAllocator bAlloc;
	llvm::StringSaver saver{bAlloc};
	llvm::DenseMap<void , SpecificAllocBase > instances;

	ErrorHandler e;			ErrorHandler e;

				// Returns the thread-local AllocContext.
				AllocContext *alloc();

				private:
				llvm::sys::ThreadLocal<AllocContext> contextIndex;
				oontvooUnsubmitted Not Done Reply Inline Actions The reason I reverted the revision that used ThreadLocal was because there was a ~1% regression linking chromium observed by @MaskRay . (similarly for preserving the options of using non-threadlocal make()/saver() ) While I would agree with you that the code would so much simpler (like in this patch) to have all ports unconditionally use this new thread-safe allocators, I'm not sure we should do that at the cost of performance regressions. For Macho, I didn't see any difference either way, but ELF seemed to get slower. oontvoo: The reason I reverted the revision that used ThreadLocal was because there was a ~1% regression…
				aganeaAuthorUnsubmitted Done Reply Inline Actions @oontvoo I agree with you, out of the box this patch can cause pref. regressions to existing code. The COFF driver doesn't use that much allocations in multi-threaded code, that's why I don't see differences. Although the regressions are solvable in my view. @MaskRay are you able to pinpoint in which user code the 1% regression is coming from? Can we retrieve the `SpecificAlloc<>` at a higher level in those cases, and then call `SpecificAlloc<>.make()`, to avoid fetching the TLS address on every `make()` call? aganea: @oontvoo I agree with you, out of the box this patch can cause pref. regressions to existing…
				// Vector of the TLS context's addresses.
				// We store the TLS variable's addresses so that we can reset the variable's
				// value from another thread.
				std::vector<AllocContext *> allContexts;
				llvm::sys::RWMutex allContextsMutex;
	};			};

	// Retrieve the global state. Currently only one state can exist per process,			// Retrieve the global state. Currently only one state can exist per process,
	// but in the future we plan on supporting an arbitrary number of LLD instances			// but in the future we plan on supporting an arbitrary number of LLD instances
	// in a single process.			// in a single process.
	CommonLinkerContext &commonContext();			CommonLinkerContext &commonContext();

	template <typename T = CommonLinkerContext> T &context() {			template <typename T = CommonLinkerContext> T &context() {
	return static_cast<T &>(commonContext());			return static_cast<T &>(commonContext());
	}			}

	bool hasContext();			bool hasContext();

	inline llvm::StringSaver &saver() { return context().saver; }			inline llvm::StringSaver &saver() { return context().alloc()->saver; }
	inline llvm::BumpPtrAllocator &bAlloc() { return context().bAlloc; }
				inline llvm::BumpPtrAllocator &bAlloc() { return context().alloc()->bAlloc; }

	} // namespace lld			} // namespace lld

	#endif			#endif

lld/include/lld/Common/Memory.h

	Show All 27 Lines
	// SpecificAlloc<> instances.			// SpecificAlloc<> instances.
	struct SpecificAllocBase {			struct SpecificAllocBase {
	virtual ~SpecificAllocBase() = default;			virtual ~SpecificAllocBase() = default;
	static SpecificAllocBase getOrCreate(void tag, size_t size, size_t align,			static SpecificAllocBase getOrCreate(void tag, size_t size, size_t align,
	SpecificAllocBase (&creator)(void ));			SpecificAllocBase (&creator)(void ));
	};			};

	// An arena of specific types T, created on-demand.			// An arena of specific types T, created on-demand.
	template <class T> struct SpecificAlloc : public SpecificAllocBase {			template <class T> class SpecificAlloc : public SpecificAllocBase {
				// Ensure that users don't accidentally create instances on the stack.
				SpecificAlloc() = default;

				public:
	static SpecificAllocBase create(void storage) {			static SpecificAllocBase create(void storage) {
	return new (storage) SpecificAlloc<T>();			return new (storage) SpecificAlloc<T>();
	}			}
	llvm::SpecificBumpPtrAllocator<T> alloc;
	static int tag;
	};

	// The address of this static member is only used as a key in
	// CommonLinkerContext::instances. Its value does not matter.
	template <class T> int SpecificAlloc<T>::tag = 0;

	// Creates the arena on-demand on the first call; or returns it, if it was			// Creates the arena on-demand on the first call; or returns it, if it was
	// already created.			// already created.
	template <typename T>			inline static SpecificAlloc<T> &get() {
	inline llvm::SpecificBumpPtrAllocator<T> &getSpecificAllocSingleton() {
	SpecificAllocBase *instance = SpecificAllocBase::getOrCreate(			SpecificAllocBase *instance = SpecificAllocBase::getOrCreate(
	&SpecificAlloc<T>::tag, sizeof(SpecificAlloc<T>),			&SpecificAlloc<T>::tag, sizeof(SpecificAlloc<T>),
	alignof(SpecificAlloc<T>), SpecificAlloc<T>::create);			alignof(SpecificAlloc<T>), SpecificAlloc<T>::create);
	return ((SpecificAlloc<T> *)instance)->alloc;			return (SpecificAlloc<T> )instance;
				}

				// Creates new instances of T off a (almost) contiguous arena/object pool. The
				// instances are destroyed whenever lldMain() goes out of scope.
				template <typename... U> T *make(U &&...args) {
				return new (allocator.Allocate()) T(std::forward<U>(args)...);
	}			}

				llvm::SpecificBumpPtrAllocator<T> allocator;
				static int tag;
				};

				// The address of this static member is only used as a key in
				// CommonLinkerContext::instances. Its value does not matter.
				template <class T> int SpecificAlloc<T>::tag = 0;

	// Creates new instances of T off a (almost) contiguous arena/object pool. The			// Creates new instances of T off a (almost) contiguous arena/object pool. The
	// instances are destroyed whenever lldMain() goes out of scope.			// instances are destroyed whenever lldMain() goes out of scope.
	template <typename T, typename... U> T *make(U &&... args) {			template <typename T, typename... U> T *make(U &&... args) {
	return new (getSpecificAllocSingleton<T>().Allocate())			return SpecificAlloc<T>::get().make(std::forward<U>(args)...);
	T(std::forward<U>(args)...);
	}			}

	} // namespace lld			} // namespace lld

	#endif			#endif