This is an archive of the discontinued LLVM Phabricator instance.

[LLD] Alternate implementation for "Support per-thread allocators and StringSavers"
Needs ReviewPublic

Authored by aganea on Apr 15 2022, 3:40 PM.

Details

Summary

This is an alternate implementation based on @oontvoo's D122922.

The main differences are:

  • LLD_THREAD_SAFE_MEMORY was removed
  • AllocContext is always thread-local.
  • Using llvm::sys::ThreadLocal to make TLS allocation dynamic at runtime. This is to accommodate for several instances of CommonLinkerContext running concurrently.
  • No "safe" or "perThread" functions, the APIs remain the same as before.

I did not see any divergence in performance when using a two-stage LLD, built with -DLLVM_INTEGRATED_CRT_ALLOC=rpmalloc, with ThinLTO & -march=native.

Chromium's chrome.dll:

D:\git\chromium\src\out\Default>hyperfine "d:\git\llvm-project\stage2_rpmalloc\bin\_globalbump\lld-link.exe @__link_chrome_dll.rsp" "d:\git\llvm-project\stage2_rpmalloc\bin\_tlbump\lld-link.exe @__link_chrome_dll.rsp"
Benchmark 1: d:\git\llvm-project\stage2_rpmalloc\bin\_globalbump\lld-link.exe @__link_chrome_dll.rsp
  Time (mean ± σ):     10.971 s ±  0.037 s    [User: 0.001 s, System: 0.001 s]
  Range (min … max):   10.913 s … 11.044 s    10 runs

Benchmark 2: d:\git\llvm-project\stage2_rpmalloc\bin\_tlbump\lld-link.exe @__link_chrome_dll.rsp
  Time (mean ± σ):     10.974 s ±  0.050 s    [User: 0.000 s, System: 0.001 s]
  Range (min … max):   10.908 s … 11.072 s    10 runs

Summary
  'd:\git\llvm-project\stage2_rpmalloc\bin\_globalbump\lld-link.exe @__link_chrome_dll.rsp' ran
    1.00 ± 0.01 times faster than 'd:\git\llvm-project\stage2_rpmalloc\bin\_tlbump\lld-link.exe @__link_chrome_dll.rsp'

Chromium's unit_tests.exe:

D:\git\chromium\src\out\Default>hyperfine "d:\git\llvm-project\stage2_rpmalloc\bin\_globalbump\lld-link.exe @__link_unit_tests.rsp" "d:\git\llvm-project\stage2_rpmalloc\bin\_tlbump\lld-link.exe @__link_unit_tests.rsp"
Benchmark 1: d:\git\llvm-project\stage2_rpmalloc\bin\_globalbump\lld-link.exe @__link_unit_tests.rsp
  Time (mean ± σ):     17.512 s ±  0.197 s    [User: 0.001 s, System: 0.001 s]
  Range (min … max):   17.311 s … 17.933 s    10 runs

Benchmark 2: d:\git\llvm-project\stage2_rpmalloc\bin\_tlbump\lld-link.exe @__link_unit_tests.rsp
  Time (mean ± σ):     17.509 s ±  0.080 s    [User: 0.001 s, System: 0.003 s]
  Range (min … max):   17.387 s … 17.658 s    10 runs

Summary
  'd:\git\llvm-project\stage2_rpmalloc\bin\_tlbump\lld-link.exe @__link_unit_tests.rsp' ran
    1.00 ± 0.01 times faster than 'd:\git\llvm-project\stage2_rpmalloc\bin\_globalbump\lld-link.exe @__link_unit_tests.rsp'

Diff Detail

Event Timeline

aganea created this revision.Apr 15 2022, 3:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 15 2022, 3:40 PM
aganea requested review of this revision.Apr 15 2022, 3:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 15 2022, 3:40 PM

This is just for demonstrative purposes, @oontvoo feel free to cherry-pick anything of interest into the other patch.

oontvoo added inline comments.Apr 18 2022, 6:33 AM
lld/include/lld/Common/CommonLinkerContext.h
54

The reason I reverted the revision that used ThreadLocal was because there was a ~1% regression linking chromium observed by @MaskRay .
(similarly for preserving the options of using non-threadlocal make()/saver() )

While I would agree with you that the code would so much simpler (like in this patch) to have all ports unconditionally use this new thread-safe allocators, I'm not sure we should do that at the cost of performance regressions. For Macho, I didn't see any difference either way, but ELF seemed to get slower.

aganea added inline comments.Apr 18 2022, 6:42 AM
lld/include/lld/Common/CommonLinkerContext.h
54

@oontvoo I agree with you, out of the box this patch can cause pref. regressions to existing code. The COFF driver doesn't use that much allocations in multi-threaded code, that's why I don't see differences. Although the regressions are solvable in my view. @MaskRay are you able to pinpoint in which user code the 1% regression is coming from? Can we retrieve the SpecificAlloc<> at a higher level in those cases, and then call SpecificAlloc<>.make(), to avoid fetching the TLS address on every make() call?

aganea added a comment.EditedApr 18 2022, 7:12 AM

I see no difference on the ELF side, but @MaskRay maybe you're running a more through test? This is a two-stage LLD, second stage uses -march=native -Xclang -O3 -fstrict-aliasing -fwhole-program-vtables -flto=thin, running on a c5a.24xlarge EC2 instance.

ubuntu@XXX:~/chromium/src/out/Default$ uname -a
Linux XXX 5.13.0-1021-aws #23~20.04.2-Ubuntu SMP Thu Mar 31 11:36:15 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
ubuntu@XXX:~/chromium/src/out/Default$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          96
On-line CPU(s) list:             0-95
Thread(s) per core:              2
Core(s) per socket:              48
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7R32
Stepping:                        0
CPU MHz:                         2799.992
BogoMIPS:                        5599.98
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       1.5 MiB
L1i cache:                       1.5 MiB
L2 cache:                        24 MiB
L3 cache:                        192 MiB
NUMA node0 CPU(s):               0-95
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; LFENCE, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
                                 fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmu
                                 lqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_lega
                                 cy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 r
                                 dseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid

(bin_tls_bumpptr is with this patch, bin_no_tls is without, using Chromium checkout 8660f8deda4982de98baddafaffb651898144bac and LLVM checkout b859c39c40a79ff74033f67d807a18130b9afe30)

ubuntu@XXX:~/chromium/src/out/Default$ hyperfine '~/llvm-project/stage2/bin_tls_bumpptr/ld.lld @__link_unit_tests.rsp' '~/llvm-project/stage2/bin_no_tls/ld.lld @__link_unit_tests.rsp'
Benchmark 1: ~/llvm-project/stage2/bin_tls_bumpptr/ld.lld @__link_unit_tests.rsp
  Time (mean ± σ):      9.743 s ±  0.057 s    [User: 21.273 s, System: 78.149 s]
  Range (min … max):    9.661 s …  9.845 s    10 runs

Benchmark 2: ~/llvm-project/stage2/bin_no_tls/ld.lld @__link_unit_tests.rsp
  Time (mean ± σ):      9.770 s ±  0.046 s    [User: 21.241 s, System: 78.736 s]
  Range (min … max):    9.695 s …  9.844 s    10 runs

Summary
  '~/llvm-project/stage2/bin_tls_bumpptr/ld.lld @__link_unit_tests.rsp' ran
    1.00 ± 0.01 times faster than '~/llvm-project/stage2/bin_no_tls/ld.lld @__link_unit_tests.rsp'

ubuntu@XXX:~/chromium/src/out/Default$ hyperfine '~/llvm-project/stage2/bin_tls_bumpptr/ld.lld @__link_chrome.rsp' '~/llvm-project/stage2/bin_no_tls/ld.lld @__link_chrome.rsp'
Benchmark 1: ~/llvm-project/stage2/bin_tls_bumpptr/ld.lld @__link_chrome.rsp
  Time (mean ± σ):      6.277 s ±  0.077 s    [User: 13.156 s, System: 50.551 s]
  Range (min … max):    6.218 s …  6.484 s    10 runs

Benchmark 2: ~/llvm-project/stage2/bin_no_tls/ld.lld @__link_chrome.rsp
  Time (mean ± σ):      6.275 s ±  0.025 s    [User: 13.274 s, System: 50.396 s]
  Range (min … max):    6.233 s …  6.314 s    10 runs

Summary
  '~/llvm-project/stage2/bin_no_tls/ld.lld @__link_chrome.rsp' ran
    1.00 ± 0.01 times faster than '~/llvm-project/stage2/bin_tls_bumpptr/ld.lld @__link_chrome.rsp'