This is an archive of the discontinued LLVM Phabricator instance.

[ELF] Move gotIndex/pltIndex/globalDynIndex to SymbolAux
ClosedPublic

Authored by MaskRay on Dec 25 2021, 11:42 AM.

Details

Summary

to decrease sizeof(SymbolUnion) by 8 on ELF64 platforms.

Symbols needing such information are typically 1% or fewer (5134 out of 560520
when linking clang, 19898 out of 5550705 when linking chrome). Storing them
elsewhere can decrease memory usage and symbol initialization time.
There is a ~0.8% saving on max RSS when linking a large program.

Future direction:

  • Move some of dynsymIndex/verdefIndex/versionId to SymbolAux
  • Support mixed TLSDESC and TLS GD without increasing sizeof(SymbolUnion)

Diff Detail

Event Timeline

MaskRay created this revision.Dec 25 2021, 11:42 AM
MaskRay requested review of this revision.Dec 25 2021, 11:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 25 2021, 11:42 AM

I've made some suggestions in inline comments. This type of change does tend to make the implementation slightly harder to read and more fragile, maybe worth getting some data on how much effect this has on linking performance? If it is negligble and there are no plans to follow it up with more changes that depend on it, is it worth doing?

No fundamental objections though. Would be good to hear other opinions too?

lld/ELF/Relocations.cpp
1556

Would it be better to use sym.getPltIdx() here? It is slightly less efficient, but is more consistent with the rest of the code.

2184

These two lines look like they could be replaced with sym->allocIdx()

lld/ELF/Symbols.cpp
62

To reduce the number of dynamic allocations it may be worth using a vector and reserving 1% of the total number of global symbols in one allocation, using your data from clang?

lld/ELF/Symbols.h
92

Will be worth a comment as auxIdx isn't as obvous as gotIndex, pltIndex.

lld/ELF/SyntheticSections.cpp
658–659

Using symAux.back() directly here makes me a bit nervous as the function body of the callee is a long way from the caller. I'd be more comfortable with using symAux[sym.auxIdx].

Maybe an opportunity for sym.setPltIndex(numEntries) that could abstract this away in the same way that the getPltIndex() does.

If that doesn't appeal, perhaps something like assert(sym.auxIdx == symAuxIdx.size() - 1) to make sure that this doesn't go out of synch.

MaskRay added a comment.EditedJan 4 2022, 11:30 AM

There is a ~0.8% saving on max RSS (8-byte saving on SymbolUnion). In addition, when we support mixsed TLSDESC and TLS-GD by introducing a new Symbol member, the change will remove some (estimate: 0.4%~0.8%) memory regression.

# current
% repeat 5 (=time -f "\nCPU: %Us\tReal: %es\tRAM: %MKB" /tmp/c/0 -flavor gnu --threads=4 @response.txt -o /dev/shm/a.out; 'rm' -f /dev/shm/a.out) # current
CPU: 10.26s     Real: 9.96s     RAM: 6894960KB
CPU: 10.77s     Real: 10.01s    RAM: 6896900KB
CPU: 10.62s     Real: 9.95s     RAM: 6892492KB
CPU: 10.73s     Real: 9.97s     RAM: 6892396KB
CPU: 10.86s     Real: 9.97s     RAM: 6892672KB

# with D116281
% repeat 5 (=time -f "\nCPU: %Us\tReal: %es\tRAM: %MKB" /tmp/c/1 -flavor gnu --threads=4 @response.txt -o /dev/shm/a.out; 'rm' -f /dev/shm/a.out) # current                                                                            
CPU: 10.51s     Real: 9.72s     RAM: 6837240KB
CPU: 10.68s     Real: 9.93s     RAM: 6839616KB
CPU: 10.54s     Real: 9.70s     RAM: 6839244KB
CPU: 10.57s     Real: 9.78s     RAM: 6839652KB
CPU: 10.61s     Real: 9.91s     RAM: 6838356KB

On the performance side, there is a tiny improvement:

% hyperfine --warmup 2 --min-runs 40 "numactl -C 20-23 "{/tmp/c/0,/tmp/c/1}" -flavor gnu @response.txt --threads=4 -o lld.8"
Benchmark 1: numactl -C 20-23 /tmp/c/0 -flavor gnu @response.txt --threads=4 -o lld.8
  Time (mean ± σ):      8.882 s ±  0.042 s    [User: 9.752 s, System: 2.557 s]
  Range (min … max):    8.799 s …  8.994 s    40 runs
 
Benchmark 2: numactl -C 20-23 /tmp/c/1 -flavor gnu @response.txt --threads=4 -o lld.8
  Time (mean ± σ):      8.826 s ±  0.044 s    [User: 9.685 s, System: 2.550 s]
  Range (min … max):    8.760 s …  8.940 s    40 runs
 
Summary
  'numactl -C 20-23 /tmp/c/1 -flavor gnu @response.txt --threads=4 -o lld.8' ran
    1.01 ± 0.01 times faster than 'numactl -C 20-23 /tmp/c/0 -flavor gnu @response.txt --threads=4 -o lld.8'

When linking chrome, symtab->symbols().size() == 1806181 && symAux.size() == 19898. symAux roughly takes 1%. The dynamic resize does not happen that much.

MaskRay updated this revision to Diff 397381.Jan 4 2022, 1:04 PM
MaskRay marked 4 inline comments as done.

Thanks for the comments! Addressed

lld/ELF/Symbols.cpp
62

When linking chrome, symtab->symbols().size() == 1806181 && symAux.size() == 19898. symAux roughly takes 1%.

When linking clang (release+asserts, all targets), symtab->symbols().size() == 137246 && symAux.size() == 5184.

symAux may be too small to bother with reserve.

lld/ELF/SyntheticSections.cpp
658–659

Thanks! Using assert(sym.auxIdx == symAuxIdx.size() - 1) for now.

set*Index is not appealing to me as that introduces some called-only-once functions.

MaskRay updated this revision to Diff 397383.Jan 4 2022, 1:05 PM

add an assert

peter.smith accepted this revision.Jan 5 2022, 1:12 AM

Thanks for the updates and the data. I don't have any more comments. I've marked as approved from my side, but please wait a bit to see if the other reviewers have any comments or concerns.

This revision is now accepted and ready to land.Jan 5 2022, 1:12 AM
MaskRay edited the summary of this revision. (Show Details)Jan 9 2022, 1:40 PM
This revision was automatically updated to reflect the committed changes.