This patch parallelizes MergeTailSection just like I did to
MergeNoTailSection in r314588.
On my 2-socket 20-core 40-threads Xeon E5-2680 @ 2.8 GHz machine,
this patch shortens the clang debug build link time with -O2 from
11.35s to 5.72s. Without -O2, it's 5.23s, so the overhead of -O2 is
now about a half second in this test environment.
Seems you could store ShardID : 5 here instead of TailHash.
That would save a few bits for OutputOff and you can get rid of computations in the code.
Does it make sense?