This is another attempt to speed up string merging. You want to read

the description of https://reviews.llvm.org/D27146 first.

In this patch, I took a different approach than the probabilistic

algorithm used in D27146. Here is the algorithm.

The original code has a single hash table to merge strings. Now we

have N hash tables where N is a parallelism (currently N=16).

We invoke N threads. Each thread knows its thread index I where

0 <= I < N. For each string S in a given string set, thread I adds S

to its own hash table only if hash(S) % N == I.

When all threads are done, there are N string tables with all

duplicated strings being merged.

There are pros and cons of this algorithm compared to the

probabilistic one.

Pros:

- It naturally produces deterministic output.
- Output is guaranteed to be the smallest.

Cons:

- Slower than the probabilistic algorithm due to the work it needs to do. N threads independently visit all strings, and because the number of mergeable strings it too large, even just skipping them is fairly expensive.

On the other hand, the probabilistic algorithm doesn't need to skip any element. - Unlike the probabilistic algorithm, it degrades performance if the number of available CPU core is smaller than N, because we now have more work to do in total than the original code.

We can fix this if we are able to know in some way about how many cores are idle.

Here are perf results. The probabilistic algorithm completed the same

task in 5.227 seconds, so this algorithm is slower than that.

Before: 36095.759481 task-clock (msec) # 5.539 CPUs utilized ( +- 0.83% ) 191,033 context-switches # 0.005 M/sec ( +- 0.22% ) 8,194 cpu-migrations # 0.227 K/sec ( +- 12.24% ) 2,342,017 page-faults # 0.065 M/sec ( +- 0.06% ) 99,758,779,851 cycles # 2.764 GHz ( +- 0.79% ) 80,526,137,412 stalled-cycles-frontend # 80.72% frontend cycles idle ( +- 0.95% ) <not supported> stalled-cycles-backend 46,308,518,501 instructions # 0.46 insns per cycle # 1.74 stalled cycles per insn ( +- 0.12% ) 8,962,860,074 branches # 248.308 M/sec ( +- 0.17% ) 149,264,611 branch-misses # 1.67% of all branches ( +- 0.06% ) 6.517101649 seconds time elapsed ( +- 0.42% ) After: 45346.098328 task-clock (msec) # 8.002 CPUs utilized ( +- 0.77% ) 165,487 context-switches # 0.004 M/sec ( +- 0.24% ) 7,455 cpu-migrations # 0.164 K/sec ( +- 11.13% ) 2,347,870 page-faults # 0.052 M/sec ( +- 0.84% ) 125,725,992,168 cycles # 2.773 GHz ( +- 0.76% ) 96,550,047,016 stalled-cycles-frontend # 76.79% frontend cycles idle ( +- 0.89% ) <not supported> stalled-cycles-backend 79,847,589,597 instructions # 0.64 insns per cycle # 1.21 stalled cycles per insn ( +- 0.22% ) 13,569,202,477 branches # 299.236 M/sec ( +- 0.28% ) 200,343,507 branch-misses # 1.48% of all branches ( +- 0.16% ) 5.666585908 seconds time elapsed ( +- 0.67% )

To conclude, I lean towards the probabilistic algorithm if we can

make its output deterministic, since its faster in any sitaution

(except for pathetic inputs in which our assumption that most

duplicated strings are spread across inputs doesn't hold.)