I'm sending this patch to get fedback. I haven't convince even myself
that this is the right thing to do. But this should be interesting
to those who want to see what we can do to improve linker's latency.
String merging is one of the slowest passes in LLD because of the
sheer number of mergeable strings. For example, Clang with debug info
contains 30 millions of mergeable strings (average length is about 50
bytes). They need to be uniquified, and uniquified strings need to
get consecutive offsets in the resulting string table.
Currently, we are using a (single-threaded, regular) dense map for
string unification. Merging the 30 million strings takes about 2
seconds on my machine.
This patch implements one of my ideas about how to reduce latency by
parallelizing it. This algorithm is probabilistic, meaining that
even though duplicated strings are likely to be merged, that's not
guaranteed. As a result, it produces larger string table quickly.
(If you need to optimize in size, you could still pass -O2 which
Here's how it works.
In the first step, we take 10% of input string set to create a small
string table. The resulting string table is very unlikely to contain
all strings of the entire set, but it is likely to contain most of
duplicated strings, because duplicated strings are repeated many times.
The second step processes the remaining 90% in parallel. In this step,
we do not merge strings. So, if a string is not in the small string
table we created in the first step, that will just be appended to end
of the string table. This step completes the string table.
Here are some numbers of resulting clang executables:
Size of .debug_str section: Current 108,049,822 (+0%) Probabilistic 154,089,550 (+42.6%) No string merging 1,591,388,940 (+1472.8%) Size of resulting file: Current 1,440,453,528 (+0%) Probabilistic 1,490,597,448 (+3.5%) No string merging 2,945,020,808 (+204.5%)
The probabilistic algorithm produces larger string table, but that's
much smaller than that without string merging. Compared to the entire
executable size, the loss is only 3.5%.
Here is a speedup in latency:
Before: 36098.025468 task-clock (msec) # 5.256 CPUs utilized ( +- 0.95% ) 190,770 context-switches # 0.005 M/sec ( +- 0.25% ) 7,609 cpu-migrations # 0.211 K/sec ( +- 11.40% ) 2,378,416 page-faults # 0.066 M/sec ( +- 0.07% ) 99,645,202,279 cycles # 2.760 GHz ( +- 0.94% ) 81,128,226,367 stalled-cycles-frontend # 81.42% frontend cycles idle ( +- 1.10% ) <not supported> stalled-cycles-backend 45,662,681,567 instructions # 0.46 insns per cycle # 1.78 stalled cycles per insn ( +- 0.14% ) 8,864,616,311 branches # 245.571 M/sec ( +- 0.22% ) 146,360,227 branch-misses # 1.65% of all branches ( +- 0.06% ) 6.868559257 seconds time elapsed ( +- 0.50% ) After: 36905.733802 task-clock (msec) # 7.061 CPUs utilized ( +- 0.84% ) 159,813 context-switches # 0.004 M/sec ( +- 0.24% ) 8,079 cpu-migrations # 0.219 K/sec ( +- 12.67% ) 2,296,298 page-faults # 0.062 M/sec ( +- 0.21% ) 102,178,380,224 cycles # 2.769 GHz ( +- 0.83% ) 83,846,653,367 stalled-cycles-frontend # 82.06% frontend cycles idle ( +- 0.96% ) <not supported> stalled-cycles-backend 46,138,345,206 instructions # 0.45 insns per cycle # 1.82 stalled cycles per insn ( +- 0.15% ) 8,824,763,690 branches # 239.116 M/sec ( +- 0.24% ) 142,482,338 branch-misses # 1.61% of all branches ( +- 0.05% ) 5.227024403 seconds time elapsed ( +- 0.43% )
In terms of latency, this algorithm is a clear win.
With these results, I have a feeling that this algorithm could be
a reasonable addition to LLD. Only for a few percent of loss in size,
it reduces latency by about 25%, so it might be a good option for
daily edit-build-test cycles (on the other hand, disabling string
merging with -O0 creates 2x larger executables, which is sometimes
inconvenient even for daily development cycle.) You can still pass
-O2 to produce production binaries.
I have another idea to reduce string merging latency, so I'll
implement that later for comparison.