This is an archive of the discontinued LLVM Phabricator instance.

Avoid doing binary search.
ClosedPublic

Authored by ruiu on May 25 2016, 2:11 PM.

Details

Summary

MergedInputSection::getOffset is the busiest function in LLD if string
merging is enabled and input files have lots of mergeable sections.
It is usually the case when creating executable with debug info,
so it is pretty common.

The reason why it is slow is because it has to do faily complex
computations. For non-mergeable sections, section contents are
contiguous in output, so in order to compute an output offset,
we only have to add the output section's base address to an input
offset. But for mergeable strings, section contents are split for
merging, so they are not contigous. We've got to do some lookups.

We used to do binary search on the list of section pieces.
It is slow because I think it's hostile to branch prediction.

This patch replaces it with hash table lookup. Seems it's working
pretty well. Below is "perf stat -r10" output when linking clang
with debug info. In this case this patch speeds up about 5%.

Before:

    6721.936004 task-clock (msec)         #    1.001 CPUs utilized            ( +-  0.11% )
            234 context-switches          #    0.035 K/sec                    ( +-  5.15% )
              0 cpu-migrations            #    0.000 K/sec                    ( +- 50.92% )
      1,075,866 page-faults               #    0.160 M/sec                    ( +-  0.10% )
 18,754,243,967 cycles                    #    2.790 GHz                      ( +-  0.11% )
 10,081,501,978 stalled-cycles-frontend   #   53.76% frontend cycles idle     ( +-  0.19% )
<not supported> stalled-cycles-backend
 21,185,939,161 instructions              #    1.13  insns per cycle
                                          #    0.48  stalled cycles per insn  ( +-  0.02% )
  3,809,479,053 branches                  #  566.723 M/sec                    ( +-  0.01% )
    133,385,955 branch-misses             #    3.50% of all branches          ( +-  0.00% )

    6.717561895 seconds time elapsed                                          ( +-  0.11% )

After:

    6386.323956 task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.23% )
            353 context-switches          #    0.055 K/sec                    ( +- 18.85% )
              0 cpu-migrations            #    0.000 K/sec                    ( +- 55.28% )
      1,289,073 page-faults               #    0.202 M/sec                    ( +-  0.11% )
 17,817,334,814 cycles                    #    2.790 GHz                      ( +-  0.23% )
 10,395,036,549 stalled-cycles-frontend   #   58.34% frontend cycles idle     ( +-  0.37% )
<not supported> stalled-cycles-backend
 18,856,073,120 instructions              #    1.06  insns per cycle
                                          #    0.55  stalled cycles per insn  ( +-  0.06% )
  3,282,465,949 branches                  #  513.984 M/sec                    ( +-  0.06% )
     72,314,845 branch-misses             #    2.20% of all branches          ( +-  0.01% )

    6.384843079 seconds time elapsed                                          ( +-  0.24% )

Diff Detail

Repository
rL LLVM

Event Timeline

ruiu updated this revision to Diff 58510.May 25 2016, 2:11 PM
ruiu retitled this revision from to Avoid doing binary search..
ruiu updated this object.
ruiu added reviewers: rafael, silvas.
ruiu added a subscriber: llvm-commits.
ruiu updated this revision to Diff 58514.May 25 2016, 2:27 PM
  • Redo benchmark with "taskset -c 0" to pin down the core to get more accurate reuslts.
ruiu updated this object.May 25 2016, 2:29 PM
silvas requested changes to this revision.May 25 2016, 3:05 PM
silvas edited edge metadata.

IMO this is not worth doing. It's only 5% when this function and surrounding code is like 25%, so this doesn't really speed things up much at all.

Also, just bailing out on keeping this performance benefit in multithreaded mode seems like a poor engineering decision.

ELF/InputSection.cpp
533 ↗(On Diff #58514)

Have you analyzed the pattern of this Offset value? I suspect there is some sort of pattern that is easier to cache than using a whole map (e.g. the values might be sequential).

(as a first-order analysis, just add a printf here and see how well gzip compresses it)

540 ↗(On Diff #58514)

Did you actually measure this? The actual tradeoffs for binary search vs a hash table are fairly complex and mostly about reducing serial dependencies through memory.

http://www.pvk.ca/Blog/2012/07/03/binary-search-star-eliminates-star-branch-mispredictions/
http://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-pathological-case-for-caches/
http://arxiv.org/abs/1509.05053

547 ↗(On Diff #58514)

This is sort of counterproductive...

This revision now requires changes to proceed.May 25 2016, 3:05 PM
ruiu added a comment.May 25 2016, 3:15 PM

I was surprised you said 5% overall improvement isn't worth it. The number seems really significant to me. It is not a microbenchmark but is a real result of linking clang. Just +41 lines improves the overall link time by 5%? To me, it is between "excellent" and "it's too good to be true so I'll apply this and verify it myself."

You can see the drop of number of branch mispredictions in the perf numbers.

In D20645#440099, @ruiu wrote:

I was surprised you said 5% overall improvement isn't worth it. The number seems really significant to me. It is not a microbenchmark

It is. It is basically a microbenchmark of an ad-hoc std::upper_bound-based associative container. This patch adds a hash table in front and makes this be only 20% of the total link time instead of 25% on one data set that you looked at. It's a band-aid and adds complexity which makes it more difficult to actually optimize the linker long-term.

Also, it punts on threads, so it doesn't stack with any gains we get there (hence not a long-term solution).

Have you tried just completely eliminating this code? E.g. creating separate "input sections" for each of the elements in the mergeable input section? Think about it like this: this code is currently 25% of the link time. Are there other solutions that are less than 25% of the link time? I think it is reasonable to expect to find a solution that only costs us 5%.

but is a real result of linking clang. Just +41 lines improves the overall link time by 5%? To me, it is between "excellent" and "it's too good to be true so I'll apply this and verify it myself."

You can see the drop of number of branch mispredictions in the perf numbers.

This kind of thinking will get you stuck in a local optimum. Think of the future and how much faster LLD will be than the current LLD. Is this patch part of that future? How much faster do you want LLD to be? How much faster can it be?

Note: on windows by default (and also trivially on tmpfs; relevant to google's internal build system) LLD does not have to wait on writing to a disk. LLD has a long way to go until it is even heavily bottlenecked by disk IO; let alone being limited by DRAM bandwidth (e.g. 50GB/s on a typical low-end machine).

I tried out this patch on a build of one of our games with debug info (output binary is around 300M; getOffset is called around 11.5M times), and I can't reproduce the significant performance difference you are seeing with this patch: http://reviews.llvm.org/F1986081 (vertical axis is run time, horizontal axis is run number)


This was on a mac. I think the extra memory cost of the OffsetMap might be hurting (and it may be worse on Windows since the virtual memory system there is even poorer usually). OffsetMap adds 2 * sizeof(uintX_t) * #pieces * (1/hashTableLoad) total memory, which for ELF64 is over 16 bytes per piece; for some string tables this densemap may be on the order of the size of the entire section.
On the profile, I see MergeInputSection::getOffset go from 20.5% to 10.4% with this patch (good), but I now see a new entry for MergeInputSection::finalizePieces at 10.3% of total link time, counteracting that. MergeInputSection::finalizePieces is composed of 4.3% (of total link time) in DenseMap's InsertIntoBucketImpl and 3.4% is DenseMap::grow (the rest is 2.6% self-time of finalizePieces).

As a point of comparison, consider this simple patch: http://reviews.llvm.org/P4309
{P4309}
Compared to the control and your current patch (blue and yellow), this patch (green) gives a pretty substantial speedup on this test case with this compiler (apple clang) on this os (mac): http://reviews.llvm.org/F1986091


Zoomed in view, normalized to median of "a": http://reviews.llvm.org/F1986092

Just the CachedLastPieceIndex + 1 case gets about 5% speedup on the test case I was looking at. Adding the CachedLastPieceIndex case brings it up to about 8%.
Obviously this approach is thread-hostile and probably not worth implementing, but hopefully it shows a couple things:

  1. When so much time is spent on so few lines of code, even relatively trivial changes can cause large performance swings.
  2. Performance improvements on one test case/platform/compiler (I mean the compiler used to build LLD itself) may be neutral (or even pessimizations) on other test cases/platforms/compilers.

To be sure I reran those measurements again, this time with the calls to the 3 different versions interleaved (reference, this patch ("rui"), and the silly alternative I posted ("alt")) which confirms the results: http://reviews.llvm.org/F1986202

Since I have a profiler open again (this time OSX Instruments), here are some more notes that might be useful:

For this testcase, about 200k of the 11.5M calls to getOffset are for fixed-size (non-SHF_STRINGS) sections where we can directly index instead of do binary search. If we wanted we could do a special case check and speed up those cases we could do that (but I don't think that would have a huge effect on the overall performance for this kind of test case so I didn't bother trying).

Also, one thing to note is that if we already have a sorted list of relocations referencing the merge section, then resolving their offset to their piece index can be done almost for free in InputSection.cpp:splitStrings since we are linearly scanning it anyway. InputSection.cpp:splitStrings is actually 11% of the profile so avoiding it, optimizing it, or at least doing other useful work (e.g. hashing all the strings or resolving offsets to pieces) while we do that is probably worth it. About 4.4% of total runtime seems to be in std::vector's __emplace_back_slow_path as called by InputSection.cpp:splitStrings suggesting that we may want to look into tuning our memory allocation.

We seem to spend about 15% of the link as self-time in InputSectionBase<ELFT>::relocateNonAlloc which can probably be dramatically reduced by manually pipelining the loop (or adding some prefetches to mitigate the serial dependencies through memory there). Also hoisting out some hot cases from the various switches and appropriate __builtin_expect will likely help a bit too.
There is about 5% self time in SymbolBody::getVA (it is about 25% of the total time, but 20% of this is MergeInputSection::getOffset) which may be the serial dependence through memory in the expression SC->OutSec->getVA(); there also might be some dispatching overhead through some switches and could probably benefit from __builtin_expect or similar (I haven't looked in detail).

We spend about 12% in memcpy/memmove when writing to the output, likely bottlenecked on the kernel virtual memory subsystem (besides servicing the page fault, the kernel has to zerofill every page we touch there; also the kernel has locking contention while servicing the page fault, explaining the poor scalability with threading). This is 332ms time which for a 300MB output works out to 0.9GB/s, at least an order of magnitude below DRAM bandwidth (which memcpy/memmove should easily max out), so by tuning that (e.g. ensuring we are using 2MB pages when available, spawning threads early on to start faulting it all in, maybe use pwrite instead of mmap, etc.) that cost can be reduced to almost nothing. Note that in an -O0 link the output is 421MB and 418ms in memcpy/memmove, which is about 1GB/s (i.e. not maxing out DRAM); but in the -O0 link about 30% of time is spent in memcpy/memmove for this testcase which means that the speedup for -O0 to be expected from tuning this is dramatic.
Mac's VM subsystem is pretty bad though; I expect Linux to do better here (e.g. automatically use 2MB pages, etc.). Worth measuring.

Overall, merge sections seem to account for about 50% of the total runtime in this test case. I see on the profile:
MergeInputSection::getOffset - 20.5%
MergeOutputSection::addSection - 16.8%
MergeInputSection::splitIntoPieces - 11.1%
At -O0, we manage to recover most of this of course (we are about 40% faster on this test case based on a quick measurement).

One other interesting thing I measured (by just adding a printf): about 35% of the calls to MergeInputSection::getOffset are duplicated. I.e. 35% of all calls to getOffset on a particular MergeInputSection are for offsets that have already had getOffset called on them. (this is easy to test by printing a line for each call to MergeInputSection::getOffset with the pair ((uintptr_t)this,Offset) and then sort|uniq them to see how many are actually distinct).

The idea for caching the last piece index came from a quick observation of the pattern of offsets passed to MergeInputSection::getOffset for each MergeInputSection. The access pattern is pretty structured (and somewhat pretty; this only shows the pattern for 100 MergeInputSection's (out of over 7000)): http://reviews.llvm.org/F1986332


Drilling down I found that the differences between successive piece indexes was 1 (i.e. sequential) in 50% of cases, and 0 in 7% of cases http://reviews.llvm.org/F1986337.

Besides those, there is a very long tail and adding more cases (besides CachedLastPieceIndex + 1 and CachedLastPieceIndex) didn't speed things up in my empirical testing. Also, using CachedLastPieceIndex as an initial "seed" to partition the binary search in the case of "cache miss" didn't help much in my measurements (in fact, it actually hurt; likely because it harmed locality (i.e. highly different sequences of elements were searched each time instead of always starting at the beginning and end of the pieces array, testing the same middle every time, etc.)).

Also, at least for this test case, there is an interesting distribution of calls to MergeInputSection::getOffset made per input section: http://reviews.llvm.org/F1986325


The area under the histogram is the total number of calls to MergeInputSection::getOffset. The horizontal axis is the number of calls to MergeInputSection::getOffset for a given MergeInputSection (e.g. there is an outlier where we called MergeInputSection::getOffset 30,000 times on the MergeInputSection). The vertical axis is the total number of calls to MergeInputSection::getOffset within each bucket of the histogram.
The thing that surprised me is that sections with a smaller number of calls to getOffset made into them become increasingly abundant so that the graph is relatively flat (varying only 2-3x) up to about 15,000 on the horizontal axis.

rafael accepted this revision.May 26 2016, 5:29 AM
rafael edited edge metadata.

This looks good to me. If anyone can make it even faster awesome, send a patch. For now we keep the fastest that has been implemented.

The results I got comparing two lto builds were

chromium

master 4.360323783
patch  4.387683184 1.00627462601

chromium fast

master 1.677643336
patch  1.71745014 1.023727811

the gold plugin

master 0.369605994
patch  0.3585041 0.969962895136

clang

master 0.630896046
patch  0.608380056 0.964311093495

llvm-as

master 0.035529869
patch  0.034659865 0.975513447573

the gold plugin fsds

master 0.390681335
patch  0.379471413 0.971306737753

clang fsds

master 0.702253693
patch  0.678229592 0.965789996921

llvm-as fsds

master 0.031787751
patch  0.031102319 0.978437228856

scylla

master 4.194291385
patch  3.377171089 0.80518275413
ruiu added a subscriber: ruiu.May 26 2016, 11:51 AM

Thank you for doing this. These numbers are interesting, although some
numbers seem to be different from what I saw. At least the number I saw for
this patch is positive.

Anyways, even though there might be room to improve it further, this patch
should be fine. It doesn't hurt other improvements. If there's a better
way, we can simply do it (that would probably be replacing the code in this
patch, but that's fine). I don't see any reason why this change is
particularly controversial. I'd say that the patch before this one (which
shrinks the size of SectionPiece) would be more controversial as it is
microscopic optimization.

ruiu added a comment.May 26 2016, 12:22 PM

Not directly related to this patch, and I'm not suggesting we should do it, but here's one radical idea that could make the mergeable section handling a lot faster: instead of storing mergeable strings into sections, make them symbols and handle symbol names as symbol contents. For example, for a mergeable string "foobar", a compiler create a symbol ".merge.foobar". When a linker see a symbol that starts with ".merge." prefix (or with a special symbol type), it creates section contents just like as it reserves space in .bss for common symbols by copying the symbol name to the section, and set the symbol's address to the address in the section.

The symbol table is the single central repository to identify and uniquify pieces of data and mergeable strings can piggyback on it. I think this way is a lot faster than the current scheme to merge strings using mergeable sections.

In D20645#441351, @ruiu wrote:

Not directly related to this patch, and I'm not suggesting we should do it, but here's one radical idea that could make the mergeable section handling a lot faster: instead of storing mergeable strings into sections, make them symbols and handle symbol names as symbol contents. For example, for a mergeable string "foobar", a compiler create a symbol ".merge.foobar". When a linker see a symbol that starts with ".merge." prefix (or with a special symbol type), it creates section contents just like as it reserves space in .bss for common symbols by copying the symbol name to the section, and set the symbol's address to the address in the section.

Yes, this is exactly what I mentioned above about creating separate "input sections" for each mergeable object. I was thinking about doing it at the section level because that seems more natural for ELF (the compiler already does this to some extent in -ffunction-sections -fdata-sections), but I guess it might make sense to do it at symbol level as a sort of "atom model".

silvas accepted this revision.May 26 2016, 9:29 PM
silvas edited edge metadata.

These are the timings on windows: http://reviews.llvm.org/F1987812

Looks like this patch (labeled "rui") gives some speedup. I'm still a bit worried about the extra memory usage (on principle of not being a memory hog), but since this doesn't seem to pessimize anything then this seems fine.

A couple nits inline, but this LGTM.

ELF/InputSection.cpp
539 ↗(On Diff #58514)

These branches may be less predictable than typical branches, but I think it is premature (based on what I've seen presented as analysis) to specifically say that the slowdown is due to branch misprediction.

For example, the fact that changing the sizeof in r270717 made a significant performance difference suggests that there is a strong influence of memory access too and that branch misprediction isn't the only factor.

(see e.g. the things I linked earlier; the performance of binary search is actually quite complex and, broadly speaking, the performance difference between binary search and hash tables is due to memory, not branch prediction).

So I would prefer to remove this comment.

548 ↗(On Diff #58514)

I'd really like to avoid this if (!Config->Threads) thing. For example, this is a confounding factor if somebody wants to analyze LLD's scalability with threads, so I'd rather we not have it unless it makes a huge difference (which IIRC you said it didn't).

This revision is now accepted and ready to land.May 26 2016, 9:29 PM
This revision was automatically updated to reflect the committed changes.