This is an archive of the discontinued LLVM Phabricator instance.

Merge strings using a probabilistic algorithm to reduce latency.
AbandonedPublic

Authored by ruiu on Nov 26 2016, 9:11 PM.

Details

Reviewers
silvas
rafael
Summary

I'm sending this patch to get fedback. I haven't convince even myself
that this is the right thing to do. But this should be interesting
to those who want to see what we can do to improve linker's latency.

String merging is one of the slowest passes in LLD because of the
sheer number of mergeable strings. For example, Clang with debug info
contains 30 millions of mergeable strings (average length is about 50
bytes). They need to be uniquified, and uniquified strings need to
get consecutive offsets in the resulting string table.

Currently, we are using a (single-threaded, regular) dense map for
string unification. Merging the 30 million strings takes about 2
seconds on my machine.

This patch implements one of my ideas about how to reduce latency by
parallelizing it. This algorithm is probabilistic, meaining that
even though duplicated strings are likely to be merged, that's not
guaranteed. As a result, it produces larger string table quickly.
(If you need to optimize in size, you could still pass -O2 which
does tail-merging.)

Here's how it works.

In the first step, we take 10% of input string set to create a small
string table. The resulting string table is very unlikely to contain
all strings of the entire set, but it is likely to contain most of
duplicated strings, because duplicated strings are repeated many times.

The second step processes the remaining 90% in parallel. In this step,
we do not merge strings. So, if a string is not in the small string
table we created in the first step, that will just be appended to end
of the string table. This step completes the string table.

Here are some numbers of resulting clang executables:

Size of .debug_str section:
Current            108,049,822   (+0%)
Probabilistic      154,089,550   (+42.6%)
No string merging  1,591,388,940 (+1472.8%)

Size of resulting file:
Current            1,440,453,528 (+0%)
Probabilistic      1,490,597,448 (+3.5%)
No string merging  2,945,020,808 (+204.5%)

The probabilistic algorithm produces larger string table, but that's
much smaller than that without string merging. Compared to the entire
executable size, the loss is only 3.5%.

Here is a speedup in latency:

Before:

   36098.025468 task-clock (msec)         #    5.256 CPUs utilized            ( +-  0.95% )
        190,770 context-switches          #    0.005 M/sec                    ( +-  0.25% )
          7,609 cpu-migrations            #    0.211 K/sec                    ( +- 11.40% )
      2,378,416 page-faults               #    0.066 M/sec                    ( +-  0.07% )
 99,645,202,279 cycles                    #    2.760 GHz                      ( +-  0.94% )
 81,128,226,367 stalled-cycles-frontend   #   81.42% frontend cycles idle     ( +-  1.10% )
<not supported> stalled-cycles-backend
 45,662,681,567 instructions              #    0.46  insns per cycle
                                          #    1.78  stalled cycles per insn  ( +-  0.14% )
  8,864,616,311 branches                  #  245.571 M/sec                    ( +-  0.22% )
    146,360,227 branch-misses             #    1.65% of all branches          ( +-  0.06% )

    6.868559257 seconds time elapsed                                          ( +-  0.50% )

After:

   36905.733802 task-clock (msec)         #    7.061 CPUs utilized            ( +-  0.84% )
        159,813 context-switches          #    0.004 M/sec                    ( +-  0.24% )
          8,079 cpu-migrations            #    0.219 K/sec                    ( +- 12.67% )
      2,296,298 page-faults               #    0.062 M/sec                    ( +-  0.21% )
102,178,380,224 cycles                    #    2.769 GHz                      ( +-  0.83% )
 83,846,653,367 stalled-cycles-frontend   #   82.06% frontend cycles idle     ( +-  0.96% )
<not supported> stalled-cycles-backend
 46,138,345,206 instructions              #    0.45  insns per cycle
                                          #    1.82  stalled cycles per insn  ( +-  0.15% )
  8,824,763,690 branches                  #  239.116 M/sec                    ( +-  0.24% )
    142,482,338 branch-misses             #    1.61% of all branches          ( +-  0.05% )

    5.227024403 seconds time elapsed                                          ( +-  0.43% )

In terms of latency, this algorithm is a clear win.

With these results, I have a feeling that this algorithm could be
a reasonable addition to LLD. Only for a few percent of loss in size,
it reduces latency by about 25%, so it might be a good option for
daily edit-build-test cycles (on the other hand, disabling string
merging with -O0 creates 2x larger executables, which is sometimes
inconvenient even for daily development cycle.) You can still pass
-O2 to produce production binaries.

I have another idea to reduce string merging latency, so I'll
implement that later for comparison.

Event Timeline

ruiu updated this revision to Diff 79345.Nov 26 2016, 9:11 PM
ruiu retitled this revision from to Merge strings using a probabilistic algorithm to reduce latency..
ruiu updated this object.
ruiu added reviewers: rafael, silvas.
ruiu added a subscriber: llvm-commits.
silvas edited edge metadata.Nov 26 2016, 10:01 PM

This is a really neat idea! Great observation that the string deduplication doesn't have to be completely perfect.

One question: how much possible speedup is there from optimizing this code path, and how much of that does the current patch get?

My biggest concern with this approach is how to get deterministic output from this approach.

As it stands, the current patch is very similar to having StringTableBuilder only use StringIndexMap.insert some fraction of the time, and the rest of the time just using find and appending the string to the table if it isn't found.

ELF/OutputSections.cpp
633

This will result in non-deterministic output :/

One other question: how much of this speedup is due to the parallelizing, and how much due to the improved algorithm?

asl added a subscriber: asl.Nov 27 2016, 1:53 AM

Hi Rui,

Why can't you use two-bit counting Bloom filter to efficiently calculate whether the given string would need to be merged or not? Bloom filter filling could be done in parallel and on the second pass you'd insert the strings in the map only if they are known to have count > 1.

If you need the code for counting BF that could be filled in in parallel - let me know.

One other approach (which might even complement this) is that we might be able to come up with a fast heuristic for determining if a string is likely unique or likely not-unique. I assume there is some sort of pattern; e.g. probably file paths are duplicated a lot in debug info. We could calculate these heuristics nearly for free at the same time as we hash the strings.

With such a heuristic, we can bypass the hash table lookup altogether.

I looked more closely at the data (LLD self-link, which includes LLVM for LTO so is reasonably large). I don't think that there will be an easy static heuristic. The bulk are symbol names and the duplication factor is just a complicated function of how the input program is structured (e.g. if a particular class is a common utility used in many places, etc.).

Also, one interesting thing is that the total size is not dominated by strings with any particular number of duplicates. It is somewhat randomly spread out across various discrete numbers of duplicates.

Here is a plot. The horizontal axis is the number of duplicates. The vertical axis is the number of bytes represented by strings with that number of duplicates. Area under the curve is proportional to total string size for the non-deduplicated output.
https://reviews.llvm.org/F2622328

Another useful thing to look at is that same plot, but "integrated": https://reviews.llvm.org/F2622335

This makes it easier to see why the approach in this patch works well. About 80% of string mass is present in strings which are duplicated more than 50 times.
There are approximately 750 files in the link. So assuming random sampling, if we look at 10% of them (so, 75), then of the strings with more than 50 repetitions, we will miss approximately (700 / 750) * (699 / 749) * (698 / 748) * ... * (625 / 675) ~= 2% of them; so this approach will save almost all of that 80% string mass.
(this is just to get intuition; of course, some strings that are replicated less than 50 times will be deduplicated too, but that will only increase the savings).

asl added a comment.Nov 27 2016, 6:19 AM

Sean, effectively counting bloom filter would give almost exact answer whether the string is a singleton or duplicated. This way no hash table lookups / insertions are even necessary.

ruiu added a comment.Nov 27 2016, 8:57 AM

String merging shrinks the size of the string table from 1591 MB to 108 MB, so we already know that a string is a duplicate with 93% probability. Bloom filter could improve it to, say, 99%, but will it make a difference?

ruiu added a comment.Nov 27 2016, 9:19 AM

Sean, thank you for the investigation and the numbers! I knew some of the numbers (that's why I could come up with this), but I just implemented, set the threshold to 10%, and it just worked, so I didn't take a close look at each section contents.

One thing I'd like to note in addition to your results is that strings are already uniquified for each input section (for obvious reason... compilers don't emit redundant strings.) So the likelihood of identifying duplicate strings when picking up random samples from input sections is larger than picking up random samples from 30 million strings. Instead of a single bag containing 30 million strings, we have approximately 1000 bags containing 30 million strings in total, and inside each bag all strings are unique. That's why I construct the small string table from random samples of input sections instead of random samples of section pieces.

One other question: how much of this speedup is due to the parallelizing, and how much due to the improved algorithm?

The speedup is entirely due to the parallelizing. If you run this with a single thread, you'd get the same latency as before. And that's a good property because this algorithm doesn't slow down the linker when available CPU resource is scarce.

ELF/OutputSections.cpp
633

Hmm, that's true, and I have no good idea to make it deterministic at the moment. We can partly serialize it, but it would slow down this part of process.

In D27146#606203, @ruiu wrote:

Sean, thank you for the investigation and the numbers! I knew some of the numbers (that's why I could come up with this), but I just implemented, set the threshold to 10%, and it just worked, so I didn't take a close look at each section contents.

One thing I'd like to note in addition to your results is that strings are already uniquified for each input section (for obvious reason... compilers don't emit redundant strings.) So the likelihood of identifying duplicate strings when picking up random samples from input sections is larger than picking up random samples from 30 million strings. Instead of a single bag containing 30 million strings, we have approximately 1000 bags containing 30 million strings in total, and inside each bag all strings are unique. That's why I construct the small string table from random samples of input sections instead of random samples of section pieces.

Nice observation!

One other question: how much of this speedup is due to the parallelizing, and how much due to the improved algorithm?

The speedup is entirely due to the parallelizing. If you run this with a single thread, you'd get the same latency as before. And that's a good property because this algorithm doesn't slow down the linker when available CPU resource is scarce.

An approach like this could improve overall CPU time by reducing the amount of time in hash table lookups. Smaller hash tables are faster to look up, and so we can bail out faster for strings without duplicates.
Also, the observation that false negatives don't hurt can be used to speed up the hash table (but that would be quite a bit of work). I think you're observation that the string merging for -O1 can accept false negatives for detecting duplicates is the key idea. There are lots of different optimizations to be done based on this.

I was going to suggest using modified bloom filters that detect duplicates, but Anton beat me to it! (I didn't know such a thing already had a name "counting bloom filters").
If we already have hash values for each string, then doing the bloom filter lookup is pretty cheap, and it is easy to combine
This talk is a good example of bloom filters in a similar problem (except that "input object files" are "documents" and "strings" are "terms"): https://www.youtube.com/watch?v=80LKF2qph6I
The good thing is that the final bloom filter can be constructed hierarchically in parallel. I'm not sure how useful they would be for this application though since for string deduplication "most" strings are duplicates, and so we need to fetch the offset anyway, so there is not much advantage to try to filter out the cases where we *don't* need to do the offset lookup.

ruiu abandoned this revision.Dec 7 2016, 9:53 AM

Rafael,

Sorry for not marking this as "abandoned". I wrote three patches to approach this problem from different ways, and https://reviews.llvm.org/D27152 is the one that I want to check in. Can you take a look if you are interested?