This is an alternate implementation based on @oontvoo's D122922.
The main differences are:
- LLD_THREAD_SAFE_MEMORY was removed
- AllocContext is always thread-local.
- Using llvm::sys::ThreadLocal to make TLS allocation dynamic at runtime. This is to accommodate for several instances of CommonLinkerContext running concurrently.
- No "safe" or "perThread" functions, the APIs remain the same as before.
I did not see any divergence in performance when using a two-stage LLD, built with -DLLVM_INTEGRATED_CRT_ALLOC=rpmalloc, with ThinLTO & -march=native.
Chromium's chrome.dll:
D:\git\chromium\src\out\Default>hyperfine "d:\git\llvm-project\stage2_rpmalloc\bin\_globalbump\lld-link.exe @__link_chrome_dll.rsp" "d:\git\llvm-project\stage2_rpmalloc\bin\_tlbump\lld-link.exe @__link_chrome_dll.rsp" Benchmark 1: d:\git\llvm-project\stage2_rpmalloc\bin\_globalbump\lld-link.exe @__link_chrome_dll.rsp Time (mean ± σ): 10.971 s ± 0.037 s [User: 0.001 s, System: 0.001 s] Range (min … max): 10.913 s … 11.044 s 10 runs Benchmark 2: d:\git\llvm-project\stage2_rpmalloc\bin\_tlbump\lld-link.exe @__link_chrome_dll.rsp Time (mean ± σ): 10.974 s ± 0.050 s [User: 0.000 s, System: 0.001 s] Range (min … max): 10.908 s … 11.072 s 10 runs Summary 'd:\git\llvm-project\stage2_rpmalloc\bin\_globalbump\lld-link.exe @__link_chrome_dll.rsp' ran 1.00 ± 0.01 times faster than 'd:\git\llvm-project\stage2_rpmalloc\bin\_tlbump\lld-link.exe @__link_chrome_dll.rsp'
Chromium's unit_tests.exe:
D:\git\chromium\src\out\Default>hyperfine "d:\git\llvm-project\stage2_rpmalloc\bin\_globalbump\lld-link.exe @__link_unit_tests.rsp" "d:\git\llvm-project\stage2_rpmalloc\bin\_tlbump\lld-link.exe @__link_unit_tests.rsp" Benchmark 1: d:\git\llvm-project\stage2_rpmalloc\bin\_globalbump\lld-link.exe @__link_unit_tests.rsp Time (mean ± σ): 17.512 s ± 0.197 s [User: 0.001 s, System: 0.001 s] Range (min … max): 17.311 s … 17.933 s 10 runs Benchmark 2: d:\git\llvm-project\stage2_rpmalloc\bin\_tlbump\lld-link.exe @__link_unit_tests.rsp Time (mean ± σ): 17.509 s ± 0.080 s [User: 0.001 s, System: 0.003 s] Range (min … max): 17.387 s … 17.658 s 10 runs Summary 'd:\git\llvm-project\stage2_rpmalloc\bin\_tlbump\lld-link.exe @__link_unit_tests.rsp' ran 1.00 ± 0.01 times faster than 'd:\git\llvm-project\stage2_rpmalloc\bin\_globalbump\lld-link.exe @__link_unit_tests.rsp'
The reason I reverted the revision that used ThreadLocal was because there was a ~1% regression linking chromium observed by @MaskRay .
(similarly for preserving the options of using non-threadlocal make()/saver() )
While I would agree with you that the code would so much simpler (like in this patch) to have all ports unconditionally use this new thread-safe allocators, I'm not sure we should do that at the cost of performance regressions. For Macho, I didn't see any difference either way, but ELF seemed to get slower.