This is an archive of the discontinued LLVM Phabricator instance.

Merge strings using concurrent hash map (3rd try!)
ClosedPublic

Authored by ruiu on Nov 27 2016, 4:32 PM.

Details

Summary

Here is yet another different implementation of string merging algorithm.
And this is faster than the previous two (https://reviews.llvm.org/D27146,
https://reviews.llvm.org/D27152).

ParallelStringTableBuilder implemented in this patch is a concurrent
hash table specialized for string table creation. It doesn't support
resizing, and you cannot do anything other than inserting strings into
the builder and write the string down to a buffer. By limiting use case,
a concurrent hash table can be implemented fairly easily. (Generally it
is extremely hard.)

This algorithm creates optimized string table in terms of size, and
the output is deterministic.

The internal hash table is an open-addressing hash table, so conflicts
are resolved by using next empty buckets. That brings in nondeterminism.
If two threads tries to claim the same bucket, only one succeeds, and
the other gets next empty one. So the bucket order is not deterministic.

To fix the problem, we sort buckets after inserting all keys.
We don't need to sort the entire hash table as one unit. Instead,
we sort buckets for each streak of claimed buckets.

Here is the performance number. This is better than the probabilistic
algorithm (5.227 seconds) and the sharded hash table algorithm (5.666
seconds).

Before:

   36427.671361 task-clock (msec)         #    5.477 CPUs utilized            ( +-  1.34% )
        158,095 context-switches          #    0.004 M/sec                    ( +-  0.27% )
          6,165 cpu-migrations            #    0.169 K/sec                    ( +- 21.57% )
      2,365,415 page-faults               #    0.065 M/sec                    ( +-  0.18% )
100,831,590,020 cycles                    #    2.768 GHz                      ( +-  1.32% )
 81,880,778,356 stalled-cycles-frontend   #   81.21% frontend cycles idle     ( +-  1.55% )
<not supported> stalled-cycles-backend
 45,993,420,294 instructions              #    0.46  insns per cycle
                                          #    1.78  stalled cycles per insn  ( +-  0.17% )
  8,913,176,489 branches                  #  244.681 M/sec                    ( +-  0.28% )
    148,952,459 branch-misses             #    1.67% of all branches          ( +-  0.10% )

    6.651371241 seconds time elapsed                                          ( +-  0.80% )

After:

   46385.337835 task-clock (msec)         #    8.869 CPUs utilized            ( +-  1.14% )
        170,016 context-switches          #    0.004 M/sec                    ( +-  0.39% )
          7,903 cpu-migrations            #    0.170 K/sec                    ( +- 19.36% )
      2,302,650 page-faults               #    0.050 M/sec                    ( +-  0.08% )
128,744,691,817 cycles                    #    2.776 GHz                      ( +-  1.13% )
109,140,318,510 stalled-cycles-frontend   #   84.77% frontend cycles idle     ( +-  1.23% )
<not supported> stalled-cycles-backend
 46,600,275,432 instructions              #    0.36  insns per cycle
                                          #    2.34  stalled cycles per insn  ( +-  0.65% )
  8,953,846,757 branches                  #  193.032 M/sec                    ( +-  1.04% )
    150,976,047 branch-misses             #    1.69% of all branches          ( +-  0.19% )

    5.230174248 seconds time elapsed                                          ( +-  0.69% )

Event Timeline

ruiu updated this revision to Diff 79364.Nov 27 2016, 4:32 PM
ruiu retitled this revision from to Merge strings using concurrent hash map (3rd try!).
ruiu updated this object.
ruiu added a reviewer: silvas.
ruiu added a subscriber: llvm-commits.
ruiu updated this revision to Diff 79367.Nov 27 2016, 8:36 PM

After a nonstop whole day hacking, I somehow managed to get deterministic
output from the concurrent hash table.

The new number is 5.230 second. This is slightly slower than the previous
nondeterministic implementation (5.136 seconds), but that is not bad at all.
That is about the same as the probabilistic algorithm (5.227 seconds) and
faster than the sharded hash table algorithm (5.666 seconds).

Taking the fact that this produces deterministic, minimal output into account,
this is definitely the best algorithm so far.

The downside is that the implementation is now a bit tricky. I don't think
this is hard to read, but it's undeniably more complicated than the original,
single-threaded implementation. I believe that's acceptable in this case
though.

ruiu updated this object.Nov 27 2016, 8:39 PM
silvas edited edge metadata.Nov 27 2016, 9:28 PM

Wow, great work. I think I've convinced myself that this will be deterministic. The crucial observation is that with linear probing, the "streaks" (strings of consecutive occupied buckets) are always the same regardless of insertion order.

How much extra performance is there for sorting the streaks individually like you do in this patch? We could use a single call to std::remove_if to coalesce all the non-empty buckets and a single call to std::sort to get a deterministic order, which would be simpler. If that isn't too much of a performance cost, it would be simpler to do that.

ELF/OutputSections.cpp
475

This is extremely troubling as it implies that we need to *guarantee* that we have chosen the size large enough or else LLD will have undefined behavior.

494

This property depends critically on the linear probing. This would not hold if we used quadratic probing. It would be good to mention that.

533

Won't this end up smaller for 32-bit hosts?

579

This must be unreachable, or it must somehow signal this so that the calling code can retry the string table building with a bigger size. We can't have LLD fail to link because a user's object files don't have enough duplicate strings.

ruiu added a comment.Nov 27 2016, 9:35 PM

I will address review comments tomorrow, but here is a breakdown. When merging 29,313,742 strings, we spent

  • 185,729 us to insert them into the concurrent hash table,
  • 70,814 us to sort runs of claimed buckets,
  • 66,560 us to assign string table offsets to buckets, and
  • 124,903 us seconds to update OutputOff member for all SectionPieces.

I think the algorithm is correct, but this patch is not ready for commit yet. As you said, we need to handle the case that the hash table becomes full. (My rough idea is when it becomes full, discard the entire hash table and redo from scratch with a larger hash table. Resizing in-use concurrent hash table is extremely hard.)

ruiu updated this revision to Diff 79414.Nov 28 2016, 9:16 AM
ruiu updated this object.
ruiu edited edge metadata.
  • Move the class to Concurrent.{h,cpp}
  • Handle the case when the table becomes full
ruiu added a comment.Nov 28 2016, 9:19 AM

As to whether we should do std::remove_if and then call std::sort only once or not, I think we shouldn't do that. Sorting is O(n log n), so you don't want to make n larger by merging streaks that could be sorted independently.

I don't think that it makes sense to put this in a file with a generic name like "Concurrent", as this is a quite specialized data structure that depends on specific LLD types. Maybe just call it ConcurrentStringTableBuilder.h?
I'm a bit worried about the layering too, does this class introduce any circular dependencies?

ELF/Concurrent.cpp
24 ↗(On Diff #79414)

Why double it?

Also, you have a similar std::max calculation in MergeOutputSection<ELFT>::finalize. Do you need it in both places?

26 ↗(On Diff #79414)

Can you use a std::unique_ptr<EntryTy[]> to manage this?

73 ↗(On Diff #79414)

I think this technically needs to be atomic to avoid races.

ELF/OutputSections.cpp
772

I'm very concerned that there may be user scenarios where we end up almost always needing multiple trips through this loop (e.g., maybe "most" non-debug builds end up with only 2 duplicates on average?). How will we determine if this is the case? Otherwise, this "optimization" may backfire and users in the wild will get slower links; we need to have some feedback loop to correct this if it is the case, or be really confident that this optimization will always speed things up.

I would propose that we do a run of Poudriere or Debian or Gentoo with this patch applied and treat the resize case as a fatal error (and ideally the error message would also give the number of resizes). That way we can get an idea of how common this is.

Sorry for the delay.

ELF/Concurrent.cpp
53 ↗(On Diff #79414)

I assume the idea here is to identify as early as possible whether we underestimated the number of buckets. I like this idea. Do you have any justification for the choice of numbers? Why is this better than just waiting for the table to become full? How much better? Should the check be 50%? (Knuth Vol 3 Sec 6.4 provides a good analysis for linearly probed open-addressed hash tables; the lookup costs will skyrocket after 75% load; is that your motivation? It would be good to have a comment)

It may be better to instead track the duplication factor (or an approximation thereof). That way, even if we end up full this time, the caller can ask for the estimated duplication factor and get it right next time with high probability. This will probably be enough to avoid any worry about an excessive number of retries (with the current 10x duplication factor assumption, we might need up to 4 retries with doubling like you have it now; with an estimate, we could make it max of 2 with high probability).
Another possibility is

Also, I would like to see a comment justifying why it is deterministic whether or not IsTableFull is set to true. Note that if it is nondeterministic then LLD's output will be nondeterministic, so it is important to get this right.

ELF/Concurrent.h
34 ↗(On Diff #79414)

Using how many cores?

57 ↗(On Diff #79414)

Please mention that this is crucially dependent on the linear probing strategy. Also, I would add a scary comment on the loop that does the probing saying that it *must* be kept as linear probing.

ruiu added inline comments.Nov 30 2016, 12:55 PM
ELF/Concurrent.cpp
24 ↗(On Diff #79414)

Because my target load factor is 0.5. I chose this number because

  • I don't want to exceeds 0.75 at which point open-addressing hash table gets much slower
  • Even if it can contain all given strings, a long streak of occupied buckets would make the hash table slower because of std::sort.

In order to tune this formula, I need to run this against various programs, but that's not that much important at this moment. I'd do that later.

26 ↗(On Diff #79414)

Changed to use std::vector.

53 ↗(On Diff #79414)

I do not have precise reasoning for these questions, I can give my guts. These should be room to optimize it, but we can do that later.

  • Why it is 0.75 and not 0.5?

I think the load factor 0.5 is too early to give up. When we give up, we need to create a new table and restart. We should continue using the same table until it approaches to much higher load. 0.75 seems like a good cutoff because beyond that it gets really slow.

  • We should keep track the duplication factor, shouldn't we?

We should if we could. But currently we stop adding items to the hash table when it gets "full", so once it becomes full, we don't know whether remaining strings are duplicate or not.

I added comment about the load factor, and simplified code for isFull(). I hope this makes things clear.

73 ↗(On Diff #79414)

Yeah, I was tempted to say that this is a benign race, but by experts there is no such thing like "benign race". Fixed.

ELF/OutputSections.cpp
772

I added code here to get better estimation. Now we keep track an approximate number N of successfully inserted items. If N < NumPieces, only N / NumPieces were inserted, so we need to enlarge the table by the inverse, namely NumPieces / N.

ruiu updated this revision to Diff 79799.Nov 30 2016, 12:55 PM
  • Address review comments.
silvas added inline comments.Nov 30 2016, 9:37 PM
ELF/OutputSections.cpp
763

Where is the accompanying test?
Also, please make the name more specific.

ruiu updated this revision to Diff 80209.Dec 4 2016, 9:58 AM
ruiu marked 8 inline comments as done.
  • Address review comments.
silvas accepted this revision.Dec 4 2016, 2:24 PM
silvas edited edge metadata.

LGTM. Thanks for working on this!

This revision is now accepted and ready to land.Dec 4 2016, 2:24 PM
ruiu added a comment.Dec 5 2016, 10:59 PM

I'm struggling to improve single-core performance of this patch. It scales well, but it's single-core performance sucks. This is a table to link time of clang with debug info (unit is second). As you can see, you need at least 4 cores to take advantage of this patch.

` # of cores Before After

 1   13.462   17.048   +21.03%
 2    9.766   10.902   +10.42%
 4    7.697    6.935   -10.98%
 8    6.888    5.674   -21.39%
12    7.073    5.812   -21.69%
16    7.066    5.569   -26.88%
20    6.846    5.226   -30.99%`

I tried to optimize it, but because it fundamentally does more thing than the simple hash table approach, it is almost impossible to compete with the original algorithm (that said I think this is too slow though).

We cannot make the linker use this algorithm only when it detects 4 or more cores because a choice of algorithm affects layout of mergeable output sections. We want to get deterministic outputs for the same input regardless how many processors are available on a computer.

I started thinking that the second, sharded algorithm may be better than this one because, even though it doesn't scale like this algorithm, it's single-core performance is not that bad. I'll update the patch with performance numbers.

I'm sorry for the back-and-force.

silvas closed this revision.Mar 25 2020, 6:33 PM

we landed a different version of this IIRC. Closing diff.