When linking a Debug build clang (265MiB SHF_ALLOC sections, 920MiB uncompressed
debug info), in a --threads=1 link "Compress debug sections" takes 2/3 time and
in a --threads=8 link "Compress debug sections" takes ~70% time.
This patch splits a section into 2MiB shards and calls zlib `deflake`
parallelly.
* use `Z_SYNC_FLUSH` for all shards but the last to flush the output to a byte boundary to be concatenated with the next shard.
* use `Z_FINISH` for the last shard to set the BFINAL flag to indicate the end of the output stream (per RFC1951)
In a --threads=8 link, "Compress debug sections" is 5.7x as fast and the total
speed is 2.54x. Because the hash table for one shard is not shared with the next
shard, the output is larger but only slightly.
```
% bloaty clang.new -- clang.old
FILE SIZE VM SIZE
-------------- --------------
+0.2% +74.2Ki [ = ] 0 .debug_line
+0.1% +72.3Ki [ = ] 0 .debug_str
+0.0% +69.9Ki [ = ] 0 .debug_info
+0.1% +976 [ = ] 0 .debug_abbrev
+0.0% +882 [ = ] 0 .debug_ranges
+0.0% +218Ki [ = ] 0 TOTAL
```
The 2MiB shard size, 0.5 initial buffer size, and 1.5 grow rate are prettyBonus in not using zlib::compress
* we can compress a debug section larger than 4GiB
* peak memory usage is lower because for most shards the output size is
less than 50% input size (all less than 55% for a large binary I
arbitrary and work for some program I have tested., I'd likely to hear whatbut decreasing the initial output size does not decrease
parameters you think appropriate. memory usage)