MergedInputSection::getOffset is the busiest function in LLD if string
merging is enabled and input files have lots of mergeable sections.
It is usually the case when creating executable with debug info,
so it is pretty common.
The reason why it is slow is because it has to do faily complex
computations. For non-mergeable sections, section contents are
contiguous in output, so in order to compute an output offset,
we only have to add the output section's base address to an input
offset. But for mergeable strings, section contents are split for
merging, so they are not contigous. We've got to do some lookups.
We used to do binary search on the list of section pieces.
It is slow because I think it's hostile to branch prediction.
This patch replaces it with hash table lookup. Seems it's working
pretty well. Below is "perf stat -r10" output when linking clang
with debug info. In this case this patch speeds up about 5%.
Before:
6721.936004 task-clock (msec) # 1.001 CPUs utilized ( +- 0.11% ) 234 context-switches # 0.035 K/sec ( +- 5.15% ) 0 cpu-migrations # 0.000 K/sec ( +- 50.92% ) 1,075,866 page-faults # 0.160 M/sec ( +- 0.10% ) 18,754,243,967 cycles # 2.790 GHz ( +- 0.11% ) 10,081,501,978 stalled-cycles-frontend # 53.76% frontend cycles idle ( +- 0.19% ) <not supported> stalled-cycles-backend 21,185,939,161 instructions # 1.13 insns per cycle # 0.48 stalled cycles per insn ( +- 0.02% ) 3,809,479,053 branches # 566.723 M/sec ( +- 0.01% ) 133,385,955 branch-misses # 3.50% of all branches ( +- 0.00% ) 6.717561895 seconds time elapsed ( +- 0.11% )
After:
6386.323956 task-clock (msec) # 1.000 CPUs utilized ( +- 0.23% ) 353 context-switches # 0.055 K/sec ( +- 18.85% ) 0 cpu-migrations # 0.000 K/sec ( +- 55.28% ) 1,289,073 page-faults # 0.202 M/sec ( +- 0.11% ) 17,817,334,814 cycles # 2.790 GHz ( +- 0.23% ) 10,395,036,549 stalled-cycles-frontend # 58.34% frontend cycles idle ( +- 0.37% ) <not supported> stalled-cycles-backend 18,856,073,120 instructions # 1.06 insns per cycle # 0.55 stalled cycles per insn ( +- 0.06% ) 3,282,465,949 branches # 513.984 M/sec ( +- 0.06% ) 72,314,845 branch-misses # 2.20% of all branches ( +- 0.01% ) 6.384843079 seconds time elapsed ( +- 0.24% )
Have you analyzed the pattern of this Offset value? I suspect there is some sort of pattern that is easier to cache than using a whole map (e.g. the values might be sequential).
(as a first-order analysis, just add a printf here and see how well gzip compresses it)