Page MenuHomePhabricator

Merge strings using sharded hash tables.

Authored by ruiu on Nov 27 2016, 1:12 PM.



This is another attempt to speed up string merging. You want to read
the description of first.

In this patch, I took a different approach than the probabilistic
algorithm used in D27146. Here is the algorithm.

The original code has a single hash table to merge strings. Now we
have N hash tables where N is a parallelism (currently N=16).

We invoke N threads. Each thread knows its thread index I where
0 <= I < N. For each string S in a given string set, thread I adds S
to its own hash table only if hash(S) % N == I.

When all threads are done, there are N string tables with all
duplicated strings being merged.

There are pros and cons of this algorithm compared to the
probabilistic one.


  • It naturally produces deterministic output.
  • Output is guaranteed to be the smallest.


  • Slower than the probabilistic algorithm due to the work it needs to do. N threads independently visit all strings, and because the number of mergeable strings it too large, even just skipping them is fairly expensive.

    On the other hand, the probabilistic algorithm doesn't need to skip any element.
  • Unlike the probabilistic algorithm, it degrades performance if the number of available CPU core is smaller than N, because we now have more work to do in total than the original code.

    We can fix this if we are able to know in some way about how many cores are idle.

Here are perf results. The probabilistic algorithm completed the same
task in 5.227 seconds, so this algorithm is slower than that.


   36095.759481 task-clock (msec)         #    5.539 CPUs utilized            ( +-  0.83% )
        191,033 context-switches          #    0.005 M/sec                    ( +-  0.22% )
          8,194 cpu-migrations            #    0.227 K/sec                    ( +- 12.24% )
      2,342,017 page-faults               #    0.065 M/sec                    ( +-  0.06% )
 99,758,779,851 cycles                    #    2.764 GHz                      ( +-  0.79% )
 80,526,137,412 stalled-cycles-frontend   #   80.72% frontend cycles idle     ( +-  0.95% )
<not supported> stalled-cycles-backend
 46,308,518,501 instructions              #    0.46  insns per cycle
                                          #    1.74  stalled cycles per insn  ( +-  0.12% )
  8,962,860,074 branches                  #  248.308 M/sec                    ( +-  0.17% )
    149,264,611 branch-misses             #    1.67% of all branches          ( +-  0.06% )

    6.517101649 seconds time elapsed                                          ( +-  0.42% )


   45346.098328 task-clock (msec)         #    8.002 CPUs utilized            ( +-  0.77% )
        165,487 context-switches          #    0.004 M/sec                    ( +-  0.24% )
          7,455 cpu-migrations            #    0.164 K/sec                    ( +- 11.13% )
      2,347,870 page-faults               #    0.052 M/sec                    ( +-  0.84% )
125,725,992,168 cycles                    #    2.773 GHz                      ( +-  0.76% )
 96,550,047,016 stalled-cycles-frontend   #   76.79% frontend cycles idle     ( +-  0.89% )
<not supported> stalled-cycles-backend
 79,847,589,597 instructions              #    0.64  insns per cycle
                                          #    1.21  stalled cycles per insn  ( +-  0.22% )
 13,569,202,477 branches                  #  299.236 M/sec                    ( +-  0.28% )
    200,343,507 branch-misses             #    1.48% of all branches          ( +-  0.16% )

    5.666585908 seconds time elapsed                                          ( +-  0.67% )

To conclude, I lean towards the probabilistic algorithm if we can
make its output deterministic, since its faster in any sitaution
(except for pathetic inputs in which our assumption that most
duplicated strings are spread across inputs doesn't hold.)

Event Timeline

ruiu updated this revision to Diff 79360.Nov 27 2016, 1:12 PM
ruiu updated this revision to Diff 79361.
ruiu retitled this revision from to Merge strings using sharded hash tables..
ruiu updated this object.
ruiu added a reviewer: silvas.
ruiu added a subscriber: llvm-commits.
  • Removed unused code.
silvas edited edge metadata.Nov 27 2016, 2:58 PM

This is the way I was going to suggest making the probabilistic approach deterministic.

Unlike the probabilistic algorithm, it degrades performance if the number of available CPU core is smaller than N, because we now have more work to do in total than the original code.

Not exactly. You are assuming that there is no contention for the Offset.fetch_add(Size);; keep in mind that that atomic add can actually be as expensive as one of the hash table lookups. If the small string map is covering, say 95% of the strings, that still means that you are spending 1/20 of your time contending with other cores (these are just ballpark numbers). Using an approach like in this patch where each thread can operate truly independently (well, except for false sharing as I noted above) is generally easier to reason about and can scale better.

Also, the approach in this patch is an interesting contrast with the map lookups, since each map shard will generally be smaller, and so lookups in the hash table are faster. On the other hand, the cores aren't sharing the map in LLC, which the approach in D27146 does do.

Also, one thing that may be worth considering is reducing the overall number of times that we read all the strings from DRAM. Currently we do it a couple times I can think of:

  1. in MergeInputSection<ELFT>::splitStrings
  2. in the string deduplication
  3. when writing to the output

Ideally we could do some amount of this deduplication work (or all of it) in 1, while we still have the string in cache right after hashing it.


If two cores try to update nearby section pieces they will have false sharing. SectionPiece is only 16 bytes. So on x86-64 hosted LLD there are 4 of them per cache line, so assuming the hashes are distributed well, it's not unlikely for them to collide.

Even worse, there is negative feedback that prevents individual cores from running ahead. So the will have a hard time "spreading out" and stop stepping on each other's toes.

  • Since they all visit the string tables in the same order, if one goes ahead, it will start to experience LLC cache misses when it goes to a new string table, which will slow it down causing the others to catch up with it.
  • When there are three threads T1, T2, T3 where T1 is ahead of T2 which is ahead of T3, the following will happen: T1 updates a section piece. This leave a "landmine" for T2 to trip on (it needs to fetch and invalidate T1's cacheline when it tries to write). When T2 trips on it, then T3 can catch up with T2, etc. So the net result is that the cores won't easily spread out.

I can think of two solutions:

  1. Have forEachPiece visit the input object files in a different order for each shard (this may hurt cache locality for the string values though)
  2. Do a locality-improving sort (even just a couple levels of partitioning would be enough; it doesn't have to be perfectly sorted) of the string pieces. It doesn't even have to be sort-like really. It just needs to somehow permute the entries in a way that avoids false sharing.

One interesting thing about this patch is that it shows that visiting all the pieces is not too expensive. This suggests that maybe doing some amount of preprocessing on the pieces in order to improve locality for the other parts could be worth it.

ruiu updated this revision to Diff 80381.Dec 5 2016, 11:21 PM
ruiu edited edge metadata.

I was about to commit D27155, but it turned out that the single-core
performance of that change is too poor to check-in. So I want to submit
this patch instead.

This doesn't scale that much compared to D27155, but its single-core
performance doesn't such as well. Here is clang link time in seconds.
As you can see, it's performance is competitive if you have at least
two cores. However, it doesn't scale more than 8 cores. (I guess the
small improvements over 8 cores comes from other parts of the linker.)

  1. of cores Before After 1 13.462 15.795 +17.33% 2 9.766 10.046 +2.86% 4 7.697 7.228 -6.09% 8 6.888 5.672 -17.65% 12 7.073 5.848 -17.31% 16 7.066 5.746 -18.68% 20 6.846 5.482 -19.92%

Linking clang-scale programs only with single core is very painful
anyways, so I think this performance characteristics is okay.

ruiu added a comment.Dec 5 2016, 11:31 PM

LLD spawns the same number of threads as the number of hardware cores. I didn't adjust that part, but just bound processes to specific cores to run benchmarks. If I adjust the thread count, the performance numbers got better. Here are correct numbers.

Single core: 14.76 seconds
Two cores: 9.121 seconds

silvas added a comment.Dec 6 2016, 9:47 PM

When running with just a single core, can we still do a sharded visitation, but instead of doing NumShards passes through all the pieces (skipping ones not in the current shard), only do a single pass? We should be able to get identical layout in that case.

I.e., instead of

        if ((Hash % NumShards) != Idx)
        auto P = OffsetMap[Idx].insert({{S, Hash}, Off});

instead do:

auto P = OffsetMap[Hash % NumShards].insert({{S, Hash}, Off});

Maybe that can recover some of the single-threaded performance? If not, then we definitely need comments explaining what is causing the single-thread slowdown (which is equivalent to a throughput reduction; improving latency is good, but the throughput reduction seems to be about 20%+ which might be a problem (for example, for a buildbot))


Do we already have a helper for this in libsupport?


Please pad this so that there isn't false sharing. DenseMap is smaller than a cacheline IIRC and so currently different threads will have false sharing.

ruiu updated this revision to Diff 80964.Dec 9 2016, 3:44 PM
  • updated as per review comments.
ruiu added a comment.Dec 9 2016, 3:44 PM

Resurrected the original code to use the (non-concurrent) string table table to use it when -no-thread is given, so that this patch doesn't hurt when threads are explicitly disabled.




Done. It actually reduced the latency of this function by almost 10%. Wow.

silvas accepted this revision.Dec 9 2016, 6:03 PM
silvas edited edge metadata.

LGTM with nits.

Although I think that this type of optimization makes sense to do, really debug fission is the best approach for reducing this cost like Rafael said in the other thread. Make sure to check with Rafael that he doesn't feel strongly that we should avoid doing this right now.

My reasoning is that one of the main selling points for LLD is that it is fast. It should be fast an there should be no fine-print (e.g. have to build LLD an uncommon way like PGO or LTO, have to change your workflow to use fission, etc.). It's fine for us to have additional advice for users about how to make their links faster, but people are probably not going to be following that advice the first time they try LLD, and that's the time where they decide to keep using LLD or not.


I think you can use alignas to simplify this. (r287783 seems to be surviving in tree using alignas, so I think we can use it).

Also, I think something like alignas is actually needed to get the desired effect in all cases. Currently, OffsetMapTy probably has an actual alignment requirement of sizeof(void*), so an arrangement like:

First part of DenseMap1
------- cacheline boundary -------
Second part of DenseMap1
First part of DenseMap2
------- cacheline boundary -------
Second part of DenseMap2

is still possible, which will still have false sharing. Doubling the padding to 2x the cacheline size avoids this, but alignas is probably simpler.


The environment variable is unused (and if you want to use it, the same comment as before about more precise naming applies).

This revision is now accepted and ready to land.Dec 9 2016, 6:03 PM
silvas closed this revision.Mar 25 2020, 6:33 PM

we landed a different version of this IIRC