Here is yet another different implementation of string merging algorithm.

And this is faster than the previous two (https://reviews.llvm.org/D27146,

https://reviews.llvm.org/D27152).

ParallelStringTableBuilder implemented in this patch is a concurrent

hash table specialized for string table creation. It doesn't support

resizing, and you cannot do anything other than inserting strings into

the builder and write the string down to a buffer. By limiting use case,

a concurrent hash table can be implemented fairly easily. (Generally it

is extremely hard.)

This algorithm creates optimized string table in terms of size, and

the output is deterministic.

The internal hash table is an open-addressing hash table, so conflicts

are resolved by using next empty buckets. That brings in nondeterminism.

If two threads tries to claim the same bucket, only one succeeds, and

the other gets next empty one. So the bucket order is not deterministic.

To fix the problem, we sort buckets after inserting all keys.

We don't need to sort the entire hash table as one unit. Instead,

we sort buckets for each streak of claimed buckets.

Here is the performance number. This is better than the probabilistic

algorithm (5.227 seconds) and the sharded hash table algorithm (5.666

seconds).

Before: 36427.671361 task-clock (msec) # 5.477 CPUs utilized ( +- 1.34% ) 158,095 context-switches # 0.004 M/sec ( +- 0.27% ) 6,165 cpu-migrations # 0.169 K/sec ( +- 21.57% ) 2,365,415 page-faults # 0.065 M/sec ( +- 0.18% ) 100,831,590,020 cycles # 2.768 GHz ( +- 1.32% ) 81,880,778,356 stalled-cycles-frontend # 81.21% frontend cycles idle ( +- 1.55% ) <not supported> stalled-cycles-backend 45,993,420,294 instructions # 0.46 insns per cycle # 1.78 stalled cycles per insn ( +- 0.17% ) 8,913,176,489 branches # 244.681 M/sec ( +- 0.28% ) 148,952,459 branch-misses # 1.67% of all branches ( +- 0.10% ) 6.651371241 seconds time elapsed ( +- 0.80% ) After: 46385.337835 task-clock (msec) # 8.869 CPUs utilized ( +- 1.14% ) 170,016 context-switches # 0.004 M/sec ( +- 0.39% ) 7,903 cpu-migrations # 0.170 K/sec ( +- 19.36% ) 2,302,650 page-faults # 0.050 M/sec ( +- 0.08% ) 128,744,691,817 cycles # 2.776 GHz ( +- 1.13% ) 109,140,318,510 stalled-cycles-frontend # 84.77% frontend cycles idle ( +- 1.23% ) <not supported> stalled-cycles-backend 46,600,275,432 instructions # 0.36 insns per cycle # 2.34 stalled cycles per insn ( +- 0.65% ) 8,953,846,757 branches # 193.032 M/sec ( +- 1.04% ) 150,976,047 branch-misses # 1.69% of all branches ( +- 0.19% ) 5.230174248 seconds time elapsed ( +- 0.69% )