This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lld/ELF/
-
ELF/
-
CMakeLists.txt
-
OutputSections.h
6/10
OutputSections.cpp

Differential D117853

[ELF] Parallelize --compress-debug-sections=zlib
ClosedPublic

Authored by MaskRay on Jan 20 2022, 9:50 PM.

Download Raw Diff

Details

Reviewers

alexander-shaposhnikov
bd1976llvm
ikudrin
peter.smith
mgorny

Commits

rG4cdc4416903b: [ELF] Parallelize --compress-debug-sections=zlib

Summary

When linking a Debug build clang (265MiB SHF_ALLOC sections, 920MiB uncompressed
debug info), in a --threads=1 link "Compress debug sections" takes 2/3 time and
in a --threads=8 link "Compress debug sections" takes ~70% time.

This patch splits a section into 1MiB shards and calls zlib deflake parallelly.

use Z_SYNC_FLUSH for all shards but the last to flush the output to a byte boundary to be concatenated with the next shard
use Z_FINISH for the last shard to set the BFINAL flag to indicate the end of the output stream (per RFC1951)

In a --threads=8 link, "Compress debug sections" is 5.7x as fast and the total
speed is 2.54x. Because the hash table for one shard is not shared with the next
shard, the output is slightly larger. Better compression ratio can be achieved
by preloading the window size from the previous shard as dictionary
(deflateSetDictionary), but that is overkill.

# 1MiB shards
% bloaty clang.new -- clang.old
    FILE SIZE        VM SIZE
 --------------  --------------
  +0.3%  +129Ki  [ = ]       0    .debug_str
  +0.1%  +105Ki  [ = ]       0    .debug_info
  +0.3%  +101Ki  [ = ]       0    .debug_line
  +0.2% +2.66Ki  [ = ]       0    .debug_abbrev
  +0.0% +1.19Ki  [ = ]       0    .debug_ranges
  +0.1%  +341Ki  [ = ]       0    TOTAL

# 2MiB shards
% bloaty clang.new -- clang.old
    FILE SIZE        VM SIZE
 --------------  --------------
  +0.2% +74.2Ki  [ = ]       0    .debug_line
  +0.1% +72.3Ki  [ = ]       0    .debug_str
  +0.0% +69.9Ki  [ = ]       0    .debug_info
  +0.1%    +976  [ = ]       0    .debug_abbrev
  +0.0%    +882  [ = ]       0    .debug_ranges
  +0.0%  +218Ki  [ = ]       0    TOTAL

Bonus in not using zlib::compress

we can compress a debug section larger than 4GiB
peak memory usage is lower because for most shards the output size is less than 50% input size (all less than 55% for a large binary I tested, but decreasing the initial output size does not decrease memory usage)

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

MaskRay created this revision.Jan 20 2022, 9:50 PM

Herald added subscribers: arichardson, mgorny, emaste. · View Herald TranscriptJan 20 2022, 9:50 PM

MaskRay requested review of this revision.Jan 20 2022, 9:50 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 20 2022, 9:50 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B144741: Diff 401859.Jan 20 2022, 10:03 PM

In a large executable I tested, for all shard, compressed divided by uncompressed is smaller than 0.558342. The median is 0.408379.
I have tried 0.25 as initial output size but do not see a memory usage difference.

alexander-shaposhnikov added inline comments.Jan 20 2022, 11:38 PM

lld/ELF/OutputSections.cpp
291	I'm wondering if you have considered using llvm/Support/Compression.h (the implementation there appears to contain some bits to make it msan-friendly + error handling, but I'm not closely familiar with that code)

MaskRay added inline comments.Jan 20 2022, 11:42 PM

lld/ELF/OutputSections.cpp
291	The code is largely lld/ELF specific. If I add the code to llvm/Support/Compression.h, LLVMSupport will get bloated. Technically llvm-objcopy --compress-debug-sections can use the code as well but the two projects may have different tweaks and sharing code won't help much in my opinion.

alexander-shaposhnikov added inline comments.Jan 21 2022, 1:37 AM

lld/ELF/OutputSections.cpp
291	just in case - after looking at https://zlib.net/manual.html and https://llvm.org/doxygen/Compression_8cpp_source.html - the return values of `deflateInit2`, `deflate` or `compress2` are not ignored there. p.s. Compression.h contains wrappers around compress2, but what's going on here is a bit different, (compression of chunks + no headers), so, yeah, it answers my question above.

No objections from me. I think the speed up is worth the small amount of extra size. I've made a few small suggestions but are all subjective. I don't have a lot of large programs hanging around to test this on. I guess something like Chromium would give you another data point.

If you've not done it yet, would be good to try and open the test program in a debugger to check to see if it decompress the output. I'd expect there to be no problems but could be worth a sanity check.

lld/ELF/OutputSections.cpp
301	Typo // Allocate a buffer
356	Is it worth picking a plural as there can be more than one shard? Similarly for out and adler. For example ins, outs and adlers. I'm not sure ins and outs sound right though, perharps shardsIn and shardsOut. Again not a strong opinion.
357	Might be worth using start and end rather than i and j? I've not got a strong opinion here, happy to keep with i, j if you prefer.
365	The code above use idx for going through in[] and i for something else, could be worth using the same value?

Is there any chance to avoid buffering the compressed output? (I guess probably not, because you need to know how large it is before you write it to the output file (if you want to parallelize writing sections, which is important no doubt))

In D117853#3261856, @dblaikie wrote:

Is there any chance to avoid buffering the compressed output? (I guess probably not, because you need to know how large it is before you write it to the output file (if you want to parallelize writing sections, which is important no doubt))

I have asked myself this question... Unfortunately no. To have accurate estimate of sizes, we have to buffer all compressed output.
It's needed to compute sh_offset and sh_size fields of a .debug_* section. To know the size we need to compress it first (or estimate, but the compression ratio is not easy to estimate).

I think pigz uses an approach to only keep concurrency shards, but it does not have the requirement to know the output size beforehand.

address comments
update description

MaskRay added inline comments.Jan 21 2022, 1:21 PM

lld/ELF/OutputSections.cpp
345	This zero fills the buffer, but I have tested that removing it and adding gap filling in `writeTo` does not improve performance.

Harbormaster completed remote builds in B144923: Diff 402095.Jan 21 2022, 3:53 PM

Simplify
improve description

Harbormaster completed remote builds in B144986: Diff 402178.Jan 21 2022, 11:10 PM

MaskRay edited the summary of this revision. (Show Details)Jan 21 2022, 11:10 PM

https://maskray.me/blog/2022-01-23-compressed-debug-sections#linkers has a longer discussion why avoiding memory allocation is bad.
Note: this patch decreases memory usage because the previous deflateBound is wasteful (it's always larger than the input size).

Can we do better? At one time, the compressed data is stored in two places. One in the allocated memory holding the compressed shard, the other in the memory mapped output file. It will be nice if we can avoid memory allocation. Unfortunately we need to compute the section size, otherwise we do not know the offsets of following sections and the section header table. There is no good way estimating the compressed section size without doing the compression. Technically if the section header table along with .symtab/.shstrtab/.strtab is moved before debug sections, we can compress the debug compression and append them to the output file. The output file will unfortunately be unconventional and this will not work when a linker script specifies exact orders of sections. It is just too hacky to do so much to just save a little memory.

No objections from me too.

lld/ELF/OutputSections.cpp
347	Maybe mention `Z_BEST_SPEED` instead of just `1`?

This revision is now accepted and ready to land.Jan 24 2022, 8:07 AM

In D117853#3261870, @MaskRay wrote:

In D117853#3261856, @dblaikie wrote:

Is there any chance to avoid buffering the compressed output? (I guess probably not, because you need to know how large it is before you write it to the output file (if you want to parallelize writing sections, which is important no doubt))

I have asked myself this question... Unfortunately no. To have accurate estimate of sizes, we have to buffer all compressed output.
It's needed to compute sh_offset and sh_size fields of a .debug_* section. To know the size we need to compress it first (or estimate, but the compression ratio is not easy to estimate).

I think pigz uses an approach to only keep concurrency shards, but it does not have the requirement to know the output size beforehand.

Yeah, I guess out of scope for this change - but maybe another time. It'd break parallelism, but you could stream out a section at a time (at least for the compressed sections) and then seek back to write the sh* offset fields based on how the compression actually worked out.

I guess for Split DWARF the memory savings wouldn't be that significant, though? Do you have a sense of how much memory it'd take.

Another direction to go could be to do compressed data concatenation - if the compression algorithm supports concatenation, you could lose some size benefits and gain speed (like lld's sliding scale of string deduplication) by just concatenating the compressed sections together - predictable size and you could write the updated compressed section header based on the input sections headers.

Though I guess most of the DWARF sections remaining in the objects/linked binary when using Split DWARF require relocations to be applied, so that requires decompressing/recompressing anyway... :/

In D117853#3267965, @dblaikie wrote:

In D117853#3261870, @MaskRay wrote:

In D117853#3261856, @dblaikie wrote:

Is there any chance to avoid buffering the compressed output? (I guess probably not, because you need to know how large it is before you write it to the output file (if you want to parallelize writing sections, which is important no doubt))

I have asked myself this question... Unfortunately no. To have accurate estimate of sizes, we have to buffer all compressed output.
It's needed to compute sh_offset and sh_size fields of a .debug_* section. To know the size we need to compress it first (or estimate, but the compression ratio is not easy to estimate).

I think pigz uses an approach to only keep concurrency shards, but it does not have the requirement to know the output size beforehand.

Yeah, I guess out of scope for this change - but maybe another time. It'd break parallelism, but you could stream out a section at a time (at least for the compressed sections) and then seek back to write the sh* offset fields based on how the compression actually worked out.

I guess for Split DWARF the memory savings wouldn't be that significant, though? Do you have a sense of how much memory it'd take.

The saving is still large because of .debug_line.

Here is a -DCMAKE_BUILD_TYPE=Debug -DLLVM_TARGETS_TO_BUILD=X86 -DCMAKE_CXX_FLAGS='-gdwarf-5 -gsplit-dwarf' build of Clang.

% ~/projects/bloaty/Release/bloaty lld
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  38.0%   368Mi   0.0%       0    .debug_gnu_pubnames
  13.3%   129Mi  62.0%   129Mi    .text
  12.7%   123Mi   0.0%       0    .debug_line
  11.5%   111Mi   0.0%       0    .debug_gnu_pubtypes
  10.9%   105Mi   0.0%       0    .strtab
   2.8%  27.3Mi  13.1%  27.3Mi    .eh_frame
   2.4%  22.9Mi  11.0%  22.9Mi    .rodata
   2.2%  21.6Mi   0.0%       0    .debug_addr
   2.2%  21.0Mi   0.0%       0    .symtab
   1.3%  12.3Mi   5.9%  12.3Mi    .dynstr
   1.0%  9.37Mi   0.0%       0    .debug_rnglists
   0.7%  6.83Mi   3.3%  6.83Mi    .eh_frame_hdr
   0.4%  4.15Mi   2.0%  4.15Mi    .data.rel.ro
   0.3%  3.06Mi   1.5%  3.06Mi    .dynsym
   0.1%  1.02Mi   0.5%  1.02Mi    .hash
   0.1%   995Ki   0.0%       0    .debug_info
   0.1%   907Ki   0.4%   907Ki    .gnu.hash
   0.1%   558Ki   0.1%   249Ki    [24 Others]
   0.0%   364Ki   0.0%       0    .debug_str
   0.0%       0   0.2%   363Ki    .bss
   0.0%   261Ki   0.1%   261Ki    .gnu.version
 100.0%   970Mi 100.0%   208Mi    TOTAL

With --compress-debug-sections=zlib but not --gdb-index (so the huge not-so-useful .debug_gnu_pubnames is compressed)

% hyperfine --warmup 2 --min-runs 10 "numactl -C 20-27 "{/tmp/c/0,/tmp/c/1}" -flavor gnu @response.txt --threads=8 -o lld --compress-debug-sections=zlib"
Benchmark 1: numactl -C 20-27 /tmp/c/0 -flavor gnu @response.txt --threads=8 -o lld --compress-debug-sections=zlib
  Time (mean ± σ):     10.756 s ±  0.025 s    [User: 10.797 s, System: 1.852 s]
  Range (min … max):   10.712 s … 10.791 s    10 runs
 
Benchmark 2: numactl -C 20-27 /tmp/c/1 -flavor gnu @response.txt --threads=8 -o lld --compress-debug-sections=zlib
  Time (mean ± σ):      5.487 s ±  0.047 s    [User: 10.964 s, System: 1.830 s]
  Range (min … max):    5.403 s …  5.559 s    10 runs
 
Summary
  'numactl -C 20-27 /tmp/c/1 -flavor gnu @response.txt --threads=8 -o lld --compress-debug-sections=zlib' ran
    1.96 ± 0.02 times faster than 'numactl -C 20-27 /tmp/c/0 -flavor gnu @response.txt --threads=8 -o lld --compress-debug-sections=zlib'

With --gdb-index

% hyperfine --warmup 2 --min-runs 10 "numactl -C 20-27 "{/tmp/c/0,/tmp/c/1}" -flavor gnu @response.txt --threads=8 -o lld --compress-debug-sections=zlib --gdb-index"
Benchmark 1: numactl -C 20-27 /tmp/c/0 -flavor gnu @response.txt --threads=8 -o lld --compress-debug-sections=zlib --gdb-index
  Time (mean ± σ):      6.981 s ±  0.020 s    [User: 9.516 s, System: 1.979 s]
  Range (min … max):    6.945 s …  7.015 s    10 runs
 
Benchmark 2: numactl -C 20-27 /tmp/c/1 -flavor gnu @response.txt --threads=8 -o lld --compress-debug-sections=zlib --gdb-index
  Time (mean ± σ):      5.350 s ±  0.037 s    [User: 9.623 s, System: 1.935 s]
  Range (min … max):    5.293 s …  5.399 s    10 runs
 
Summary
  'numactl -C 20-27 /tmp/c/1 -flavor gnu @response.txt --threads=8 -o lld --compress-debug-sections=zlib --gdb-index' ran
    1.30 ± 0.01 times faster than 'numactl -C 20-27 /tmp/c/0 -flavor gnu @response.txt --threads=8 -o lld --compress-debug-sections=zlib --gdb-index'

Another direction to go could be to do compressed data concatenation - if the compression algorithm supports concatenation, you could lose some size benefits and gain speed (like lld's sliding scale of string deduplication) by just concatenating the compressed sections together - predictable size and you could write the updated compressed section header based on the input sections headers.

The concatenation approach is what used here :)

Though I guess most of the DWARF sections remaining in the objects/linked binary when using Split DWARF require relocations to be applied, so that requires decompressing/recompressing anyway... :/

The end of https://maskray.me/blog/2022-01-23-compressed-debug-sections#linkers discusses why not allocating a buffer is tricky and is not generic enough.
Updating section headers afterwards has an issue that the output file size is unknown so cannot mmap the output in a read-write way.

In D117853#3268012, @MaskRay wrote:

In D117853#3267965, @dblaikie wrote:

In D117853#3261870, @MaskRay wrote:

In D117853#3261856, @dblaikie wrote:

Is there any chance to avoid buffering the compressed output? (I guess probably not, because you need to know how large it is before you write it to the output file (if you want to parallelize writing sections, which is important no doubt))

I have asked myself this question... Unfortunately no. To have accurate estimate of sizes, we have to buffer all compressed output.
It's needed to compute sh_offset and sh_size fields of a .debug_* section. To know the size we need to compress it first (or estimate, but the compression ratio is not easy to estimate).

I think pigz uses an approach to only keep concurrency shards, but it does not have the requirement to know the output size beforehand.

Yeah, I guess out of scope for this change - but maybe another time. It'd break parallelism, but you could stream out a section at a time (at least for the compressed sections) and then seek back to write the sh* offset fields based on how the compression actually worked out.

I guess for Split DWARF the memory savings wouldn't be that significant, though? Do you have a sense of how much memory it'd take.

The saving is still large because of .debug_line.

I mostly meant the memory savings that might be available if we could avoid caching compressed debug info output sections - I guess looking at the numbers you posted, assuming lld's internal data structures don't use much memory compared to the output size & assuming you're writing to tmpfs so the output counts as memory usage - that's still like half the output file size again as memory usage for compressed output section buffers, so a possible 30% reduction in memory usage or so... which seems pretty valuable, but hard to achieve for sure.

Another direction to go could be to do compressed data concatenation - if the compression algorithm supports concatenation, you could lose some size benefits and gain speed (like lld's sliding scale of string deduplication) by just concatenating the compressed sections together - predictable size and you could write the updated compressed section header based on the input sections headers.

The concatenation approach is what used here :)

Ah, sorry, I meant concatenation of the input sections - no need to decompress or recompress, but that only applies if there are no relocations or other changes to apply to the data.

Though I guess most of the DWARF sections remaining in the objects/linked binary when using Split DWARF require relocations to be applied, so that requires decompressing/recompressing anyway... :/

The end of https://maskray.me/blog/2022-01-23-compressed-debug-sections#linkers discusses why not allocating a buffer is tricky and is not generic enough.
Updating section headers afterwards has an issue that the output file size is unknown so cannot mmap the output in a read-write way.

Ah - I think gold's dwp does it by using a pwrite stream instead - streaming out the section contents and then seeking back to modify the header, rather than memory mapped copies. Not sure what the performance tradeoffs are like for that & whether you could then go back after streaming out the compressed data - and then I guess maybe reopening as memory mapped to write out the rest of the contents.

In D117853#3268030, @dblaikie wrote:

In D117853#3268012, @MaskRay wrote:

In D117853#3267965, @dblaikie wrote:

In D117853#3261870, @MaskRay wrote:

In D117853#3261856, @dblaikie wrote:

Is there any chance to avoid buffering the compressed output? (I guess probably not, because you need to know how large it is before you write it to the output file (if you want to parallelize writing sections, which is important no doubt))

I have asked myself this question... Unfortunately no. To have accurate estimate of sizes, we have to buffer all compressed output.
It's needed to compute sh_offset and sh_size fields of a .debug_* section. To know the size we need to compress it first (or estimate, but the compression ratio is not easy to estimate).

I think pigz uses an approach to only keep concurrency shards, but it does not have the requirement to know the output size beforehand.

Yeah, I guess out of scope for this change - but maybe another time. It'd break parallelism, but you could stream out a section at a time (at least for the compressed sections) and then seek back to write the sh* offset fields based on how the compression actually worked out.

I guess for Split DWARF the memory savings wouldn't be that significant, though? Do you have a sense of how much memory it'd take.

The saving is still large because of .debug_line.

I mostly meant the memory savings that might be available if we could avoid caching compressed debug info output sections - I guess looking at the numbers you posted, assuming lld's internal data structures don't use much memory compared to the output size & assuming you're writing to tmpfs so the output counts as memory usage - that's still like half the output file size again as memory usage for compressed output section buffers, so a possible 30% reduction in memory usage or so... which seems pretty valuable, but hard to achieve for sure.

There will be some memory savings but I am speculating that it is small.
My rationale is that zlib::compress allocates a compressed buffer whose size is a bit larger than the input size (zlib deflateBound).
(This is actually a saving many projects do not realize (jdk,ffmpeg,etc))
This patch switches to half by default but I see a very small memory usage decrease (I don't remember clearly, but definitely less than 2%).
So I speculate that even if I drop the output buffer entirely, the saving won't be large.
The likely reason is that the memory just overlaps some data structures allocated by previous passes.
I haven't use a heap profiler to look into it more deeply.

Another direction to go could be to do compressed data concatenation - if the compression algorithm supports concatenation, you could lose some size benefits and gain speed (like lld's sliding scale of string deduplication) by just concatenating the compressed sections together - predictable size and you could write the updated compressed section header based on the input sections headers.

The concatenation approach is what used here :)

Ah, sorry, I meant concatenation of the input sections - no need to decompress or recompress, but that only applies if there are no relocations or other changes to apply to the data.

Oh, you mean compressing input sections individually and than concatenating them.
I've thought about this.
One big issue is that initializating zlib data structures takes time.
If we create z_stream one for every input section, the overhead may be too high.
See https://zlib.net/zlib_tech.html "Memory Footprint", the time complexity is comparable with the memory footprint.
Maybe someone interested can do the experiments.
My bet is that even if it may some memory usage benefit, the CPU overhead may be too large (I will not be surprised if it is even slower than the status quo).

In D117853#3268050, @MaskRay wrote:

In D117853#3268030, @dblaikie wrote:

In D117853#3268012, @MaskRay wrote:

In D117853#3267965, @dblaikie wrote:

In D117853#3261870, @MaskRay wrote:

In D117853#3261856, @dblaikie wrote:

Is there any chance to avoid buffering the compressed output? (I guess probably not, because you need to know how large it is before you write it to the output file (if you want to parallelize writing sections, which is important no doubt))

I have asked myself this question... Unfortunately no. To have accurate estimate of sizes, we have to buffer all compressed output.
It's needed to compute sh_offset and sh_size fields of a .debug_* section. To know the size we need to compress it first (or estimate, but the compression ratio is not easy to estimate).

I think pigz uses an approach to only keep concurrency shards, but it does not have the requirement to know the output size beforehand.

Yeah, I guess out of scope for this change - but maybe another time. It'd break parallelism, but you could stream out a section at a time (at least for the compressed sections) and then seek back to write the sh* offset fields based on how the compression actually worked out.

I guess for Split DWARF the memory savings wouldn't be that significant, though? Do you have a sense of how much memory it'd take.

The saving is still large because of .debug_line.

I mostly meant the memory savings that might be available if we could avoid caching compressed debug info output sections - I guess looking at the numbers you posted, assuming lld's internal data structures don't use much memory compared to the output size & assuming you're writing to tmpfs so the output counts as memory usage - that's still like half the output file size again as memory usage for compressed output section buffers, so a possible 30% reduction in memory usage or so... which seems pretty valuable, but hard to achieve for sure.

There will be some memory savings but I am speculating that it is small.
My rationale is that zlib::compress allocates a compressed buffer whose size is a bit larger than the input size (zlib deflateBound).
(This is actually a saving many projects do not realize (jdk,ffmpeg,etc))
This patch switches to half by default but I see a very small memory usage decrease (I don't remember clearly, but definitely less than 2%).
So I speculate that even if I drop the output buffer entirely, the saving won't be large.
The likely reason is that the memory just overlaps some data structures allocated by previous passes.
I haven't use a heap profiler to look into it more deeply.

Yeah, might be interesting to know where peak linker memory usage is - if this isn't at the peak point, that's fair - less to worry about.

Another direction to go could be to do compressed data concatenation - if the compression algorithm supports concatenation, you could lose some size benefits and gain speed (like lld's sliding scale of string deduplication) by just concatenating the compressed sections together - predictable size and you could write the updated compressed section header based on the input sections headers.

The concatenation approach is what used here :)

Ah, sorry, I meant concatenation of the input sections - no need to decompress or recompress, but that only applies if there are no relocations or other changes to apply to the data.

Oh, you mean compressing input sections individually and than concatenating them.
I've thought about this.
One big issue is that initializating zlib data structures takes time.
If we create z_stream one for every input section, the overhead may be too high.

Ah, sorry, no, I meant taking the already-compressed input sections and writing them straight to the output without the linker ever decompressing or compressing this data. Which, yeah, only applies if there are no relocations to apply - which is more relevant with dwp (where I mostly have in mind) than with lld (if you're using Split DWARF - if you're not using Split DWARF but you are using DWARFv5, there might be more opportunities for DWARF sections that have no relocations), though some sections even with Split DWARF have no relocations, like .debug_rnglists for instance.

In D117853#3268063, @dblaikie wrote:

Yeah, might be interesting to know where peak linker memory usage is - if this isn't at the peak point, that's fair - less to worry about.

Another direction to go could be to do compressed data concatenation - if the compression algorithm supports concatenation, you could lose some size benefits and gain speed (like lld's sliding scale of string deduplication) by just concatenating the compressed sections together - predictable size and you could write the updated compressed section header based on the input sections headers.

The concatenation approach is what used here :)

Ah, sorry, I meant concatenation of the input sections - no need to decompress or recompress, but that only applies if there are no relocations or other changes to apply to the data.

Oh, you mean compressing input sections individually and than concatenating them.
I've thought about this.
One big issue is that initializating zlib data structures takes time.
If we create z_stream one for every input section, the overhead may be too high.

Ah, sorry, no, I meant taking the already-compressed input sections and writing them straight to the output without the linker ever decompressing or compressing this data. Which, yeah, only applies if there are no relocations to apply - which is more relevant with dwp (where I mostly have in mind) than with lld (if you're using Split DWARF - if you're not using Split DWARF but you are using DWARFv5, there might be more opportunities for DWARF sections that have no relocations), though some sections even with Split DWARF have no relocations, like .debug_rnglists for instance.

OK, got it:) Strip the zlib header and the trailer of a compressed input section and concatenate the data part.
This is what https://github.com/madler/zlib/blob/master/examples/gzjoin.c#L34 does. It does not re-compress the output but needs to uncompresses input to get the final block marker (BFINAL).
The implementation is a bit involved and more, the compressed data may not be retained (see D52917 for data()). --gdb-index needs to uncompress .debug_info.
If we want to leverage this optimization (the output will be larger because the default 32KiB window size is essentially shrunk to the input section size), there would be quite involved changes....

The feedback is positive. I'll push this tomorrow.

MaskRay edited the summary of this revision. (Show Details)Jan 25 2022, 10:24 AM

Closed by commit rG4cdc4416903b: [ELF] Parallelize --compress-debug-sections=zlib (authored by MaskRay). · Explain WhyJan 25 2022, 10:29 AM

This revision was automatically updated to reflect the committed changes.

MaskRay added a commit: rG4cdc4416903b: [ELF] Parallelize --compress-debug-sections=zlib.

mgorny reopened this revision.Feb 6 2022, 5:10 AM

mgorny added inline comments.

lld/ELF/OutputSections.cpp
18	This breaks the build against installed LLVM since `config.h` is a private header. I guess you're looking to add a new constant to `llvm-config.h`.

This revision is now accepted and ready to land.Feb 6 2022, 5:10 AM

mgorny requested changes to this revision.Feb 6 2022, 5:10 AM

This revision now requires changes to proceed.Feb 6 2022, 5:10 AM

The llvm-config.h thing is discussed in D119058.

I don't think the standalone build is officially supported (removed for some projects) and does not work due to GetErrcMessages and a libunwind header issue, so if it going to be problematic we may have to take the compromise.

MaskRay removed a reviewer: mgorny.Feb 7 2022, 2:01 PM

This revision is now accepted and ready to land.Feb 7 2022, 2:01 PM

MaskRay added a reviewer: mgorny.Feb 7 2022, 2:02 PM

MaskRay closed this revision.Feb 7 2022, 4:18 PM

MaskRay mentioned this in D121512: [Support] Change zlib::compress to return void.Mar 11 2022, 9:46 PM

MaskRay mentioned this in D128667: [WIP] Add Zstd ELF support.Jun 27 2022, 10:43 AM

MaskRay mentioned this in D133548: [ELF] Add --compress-debug-sections=zstd.Sep 8 2022, 6:50 PM

MaskRay mentioned this in rG449f2ca146dd: [ELF] Add --compress-debug-sections=zstd.Sep 9 2022, 10:30 AM

MaskRay mentioned this in D133679: [ELF] Parallelize --compress-debug-sections=zstd.Sep 11 2022, 9:39 PM

MaskRay mentioned this in rGfa74144c64df: [ELF] Parallelize --compress-debug-sections=zstd.Sep 21 2022, 11:13 AM

Revision Contents

Path

Size

lld/

ELF/

CMakeLists.txt

5 lines

OutputSections.h

8 lines

OutputSections.cpp

100 lines

Diff 402095

lld/ELF/CMakeLists.txt

set(LLVM_TARGET_DEFINITIONS Options.td)		set(LLVM_TARGET_DEFINITIONS Options.td)
tablegen(LLVM Options.inc -gen-opt-parser-defs)		tablegen(LLVM Options.inc -gen-opt-parser-defs)
add_public_tablegen_target(ELFOptionsTableGen)		add_public_tablegen_target(ELFOptionsTableGen)

		if(LLVM_ENABLE_ZLIB)
		set(imported_libs ZLIB::ZLIB)
		endif()

add_lld_library(lldELF		add_lld_library(lldELF
AArch64ErrataFix.cpp		AArch64ErrataFix.cpp
Arch/AArch64.cpp		Arch/AArch64.cpp
Arch/AMDGPU.cpp		Arch/AMDGPU.cpp
Arch/ARM.cpp		Arch/ARM.cpp
Arch/AVR.cpp		Arch/AVR.cpp
Arch/Hexagon.cpp		Arch/Hexagon.cpp
Arch/Mips.cpp		Arch/Mips.cpp
Show All 40 Lines	add_lld_library(lldELF
MC		MC
Object		Object
Option		Option
Passes		Passes
Support		Support

LINK_LIBS		LINK_LIBS
lldCommon		lldCommon
		${imported_libs}
${LLVM_PTHREAD_LIB}		${LLVM_PTHREAD_LIB}

DEPENDS		DEPENDS
ELFOptionsTableGen		ELFOptionsTableGen
intrinsics_gen		intrinsics_gen
)		)

lld/ELF/OutputSections.h

Show All 19 Lines

namespace lld {		namespace lld {
namespace elf {		namespace elf {

struct PhdrEntry;		struct PhdrEntry;
class InputSection;		class InputSection;
class InputSectionBase;		class InputSectionBase;

		struct CompressedData {
		std::unique_ptr<SmallVector<uint8_t, 0>[]> shards;
		uint32_t numShards;
		uint32_t checksum;
		};

// This represents a section in an output file.		// This represents a section in an output file.
// It is composed of multiple InputSections.		// It is composed of multiple InputSections.
// The writer creates multiple OutputSections and assign them unique,		// The writer creates multiple OutputSections and assign them unique,
// non-overlapping file offsets and VAs.		// non-overlapping file offsets and VAs.
class OutputSection final : public SectionCommand, public SectionBase {		class OutputSection final : public SectionCommand, public SectionBase {
public:		public:
OutputSection(StringRef name, uint32_t type, uint64_t flags);		OutputSection(StringRef name, uint32_t type, uint64_t flags);

▲ Show 20 Lines • Show All 72 Lines • ▼ Show 20 Lines	public:

void sort(llvm::function_ref<int(InputSectionBase *s)> order);		void sort(llvm::function_ref<int(InputSectionBase *s)> order);
void sortInitFini();		void sortInitFini();
void sortCtorsDtors();		void sortCtorsDtors();

private:		private:
// Used for implementation of --compress-debug-sections option.		// Used for implementation of --compress-debug-sections option.
SmallVector<uint8_t, 0> zDebugHeader;		SmallVector<uint8_t, 0> zDebugHeader;
SmallVector<char, 0> compressedData;		CompressedData compressed;

std::array<uint8_t, 4> getFiller();		std::array<uint8_t, 4> getFiller();
};		};

int getPriority(StringRef s);		int getPriority(StringRef s);

InputSection getFirstInputSection(const OutputSection os);		InputSection getFirstInputSection(const OutputSection os);
SmallVector<InputSection *, 0> getInputSections(const OutputSection &os);		SmallVector<InputSection *, 0> getInputSections(const OutputSection &os);
Show All 21 Lines

lld/ELF/OutputSections.cpp

	Show All 9 Lines
	#include "Config.h"			#include "Config.h"
	#include "LinkerScript.h"			#include "LinkerScript.h"
	#include "SymbolTable.h"			#include "SymbolTable.h"
	#include "SyntheticSections.h"			#include "SyntheticSections.h"
	#include "Target.h"			#include "Target.h"
	#include "lld/Common/Memory.h"			#include "lld/Common/Memory.h"
	#include "lld/Common/Strings.h"			#include "lld/Common/Strings.h"
	#include "llvm/BinaryFormat/Dwarf.h"			#include "llvm/BinaryFormat/Dwarf.h"
	#include "llvm/Support/Compression.h"			#include "llvm/Config/config.h" // LLVM_ENABLE_ZLIB
				mgornyUnsubmitted Not Done Reply Inline Actions This breaks the build against installed LLVM since `config.h` is a private header. I guess you're looking to add a new constant to `llvm-config.h`. mgorny: This breaks the build against installed LLVM since `config.h` is a private header. I guess…
	#include "llvm/Support/MD5.h"			#include "llvm/Support/MD5.h"
	#include "llvm/Support/MathExtras.h"			#include "llvm/Support/MathExtras.h"
	#include "llvm/Support/Parallel.h"			#include "llvm/Support/Parallel.h"
	#include "llvm/Support/SHA1.h"			#include "llvm/Support/SHA1.h"
	#include "llvm/Support/TimeProfiler.h"			#include "llvm/Support/TimeProfiler.h"
	#include <regex>			#include <regex>
	#include <unordered_set>			#include <unordered_set>
				#if LLVM_ENABLE_ZLIB
				#include <zlib.h>
				#endif

	using namespace llvm;			using namespace llvm;
	using namespace llvm::dwarf;			using namespace llvm::dwarf;
	using namespace llvm::object;			using namespace llvm::object;
	using namespace llvm::support::endian;			using namespace llvm::support::endian;
	using namespace llvm::ELF;			using namespace llvm::ELF;
	using namespace lld;			using namespace lld;
	using namespace lld::elf;			using namespace lld::elf;
	▲ Show 20 Lines • Show All 245 Lines • ▼ Show 20 Lines
	static void fill(uint8_t *buf, size_t size,			static void fill(uint8_t *buf, size_t size,
	const std::array<uint8_t, 4> &filler) {			const std::array<uint8_t, 4> &filler) {
	size_t i = 0;			size_t i = 0;
	for (; i + 4 < size; i += 4)			for (; i + 4 < size; i += 4)
	memcpy(buf + i, filler.data(), 4);			memcpy(buf + i, filler.data(), 4);
	memcpy(buf + i, filler.data(), size - i);			memcpy(buf + i, filler.data(), size - i);
	}			}

				#if LLVM_ENABLE_ZLIB
				static SmallVector<uint8_t, 0> deflateShard(ArrayRef<uint8_t> in, int level,
				alexander-shaposhnikovUnsubmitted Not Done Reply Inline Actions I'm wondering if you have considered using llvm/Support/Compression.h (the implementation there appears to contain some bits to make it msan-friendly + error handling, but I'm not closely familiar with that code) alexander-shaposhnikov: I'm wondering if you have considered using llvm/Support/Compression.h (the implementation there…
				MaskRayAuthorUnsubmitted Done Reply Inline Actions The code is largely lld/ELF specific. If I add the code to llvm/Support/Compression.h, LLVMSupport will get bloated. Technically llvm-objcopy --compress-debug-sections can use the code as well but the two projects may have different tweaks and sharing code won't help much in my opinion. MaskRay: The code is largely lld/ELF specific. If I add the code to llvm/Support/Compression.h…
				alexander-shaposhnikovUnsubmitted Not Done Reply Inline Actions just in case - after looking at https://zlib.net/manual.html and https://llvm.org/doxygen/Compression_8cpp_source.html - the return values of `deflateInit2`, `deflate` or `compress2` are not ignored there. p.s. Compression.h contains wrappers around compress2, but what's going on here is a bit different, (compression of chunks + no headers), so, yeah, it answers my question above. alexander-shaposhnikov: just in case - after looking at https://zlib.net/manual.html and https://llvm.
				bool isLast) {
				z_stream s = {};
				s.avail_in = in.size();
				s.next_in = const_cast<uint8_t *>(in.data());

				// 15 and 8 are default. windowBits=-15 is negative to generate raw deflate
				// data with no zlib header or trailer.
				deflateInit2(&s, level, Z_DEFLATED, -15, 8, Z_DEFAULT_STRATEGY);

				// Allocate a buffer of half of the input size, and grow it by 1.5x if
				peter.smithUnsubmitted Done Reply Inline Actions Typo // Allocate a buffer peter.smith: Typo // Allocate a buffer
				// insufficient.
				SmallVector<uint8_t, 0> out;
				out.resize_for_overwrite(std::min<size_t>(in.size() / 2, 128));
				s.avail_out = out.size();
				s.next_out = out.data();
				for (;;) {
				deflate(&s, isLast ? Z_FINISH : Z_SYNC_FLUSH);
				if (s.avail_out != 0)
				break;
				size_t pos = s.next_out - out.data();
				size_t oldSize = out.size();
				out.resize_for_overwrite(oldSize * 3 / 2);
				s.avail_out += out.size() - oldSize;
				s.next_out = out.data() + pos;
				}
				assert(s.avail_in == 0);

				out.truncate(out.size() - s.avail_out);
				deflateEnd(&s);
				return out;
				}
				#endif

	// Compress section contents if this section contains debug info.			// Compress section contents if this section contains debug info.
	template <class ELFT> void OutputSection::maybeCompress() {			template <class ELFT> void OutputSection::maybeCompress() {
				#if LLVM_ENABLE_ZLIB
	using Elf_Chdr = typename ELFT::Chdr;			using Elf_Chdr = typename ELFT::Chdr;

	// Compress only DWARF debug sections.			// Compress only DWARF debug sections.
	if (!config->compressDebugSections \|\| (flags & SHF_ALLOC) \|\|			if (!config->compressDebugSections \|\| (flags & SHF_ALLOC) \|\|
	!name.startswith(".debug_"))			!name.startswith(".debug_") \|\| size == 0)
	return;			return;

	llvm::TimeTraceScope timeScope("Compress debug sections");			llvm::TimeTraceScope timeScope("Compress debug sections");

	// Create a section header.			// Create a section header.
	zDebugHeader.resize(sizeof(Elf_Chdr));			zDebugHeader.resize(sizeof(Elf_Chdr));
	auto hdr = reinterpret_cast<Elf_Chdr >(zDebugHeader.data());			auto hdr = reinterpret_cast<Elf_Chdr >(zDebugHeader.data());
	hdr->ch_type = ELFCOMPRESS_ZLIB;			hdr->ch_type = ELFCOMPRESS_ZLIB;
	hdr->ch_size = size;			hdr->ch_size = size;
	hdr->ch_addralign = alignment;			hdr->ch_addralign = alignment;

	// Write section contents to a temporary buffer and compress it.			// Write section contents to a temporary buffer and compress it.
	std::vector<uint8_t> buf(size);			std::vector<uint8_t> buf(size);
				MaskRayAuthorUnsubmitted Done Reply Inline Actions This zero fills the buffer, but I have tested that removing it and adding gap filling in `writeTo` does not improve performance. MaskRay: This zero fills the buffer, but I have tested that removing it and adding gap filling in…
	writeTo<ELFT>(buf.data());			writeTo<ELFT>(buf.data());
	// We chose 1 as the default compression level because it is the fastest. If			// We chose 1 as the default compression level because it is the fastest. If
				ikudrinUnsubmitted Not Done Reply Inline Actions Maybe mention `Z_BEST_SPEED` instead of just `1`? ikudrin: Maybe mention `Z_BEST_SPEED` instead of just `1`?
	// -O2 is given, we use level 6 to compress debug info more by ~15%. We found			// -O2 is given, we use level 6 to compress debug info more by ~15%. We found
	// that level 7 to 9 doesn't make much difference (~1% more compression) while			// that level 7 to 9 doesn't make much difference (~1% more compression) while
	// they take significant amount of time (~2x), so level 6 seems enough.			// they take significant amount of time (~2x), so level 6 seems enough.
	if (Error e = zlib::compress(toStringRef(buf), compressedData,			const int level = config->optimize >= 2 ? 6 : Z_BEST_SPEED;
	config->optimize >= 2 ? 6 : 1))
	fatal("compress failed: " + llvm::toString(std::move(e)));			// Split input into 2-MiB shards.
				constexpr size_t shardSize = 2 << 20;
				const size_t numShards = (size + shardSize - 1) / shardSize;
				auto shardsIn = std::make_unique<ArrayRef<uint8_t>[]>(numShards);
				peter.smithUnsubmitted Done Reply Inline Actions Is it worth picking a plural as there can be more than one shard? Similarly for out and adler. For example ins, outs and adlers. I'm not sure ins and outs sound right though, perharps shardsIn and shardsOut. Again not a strong opinion. peter.smith: Is it worth picking a plural as there can be more than one shard? Similarly for out and adler.
				for (size_t i = 0, start = 0, end; start != buf.size(); ++i, start = end) {
				peter.smithUnsubmitted Done Reply Inline Actions Might be worth using start and end rather than i and j? I've not got a strong opinion here, happy to keep with i, j if you prefer. peter.smith: Might be worth using start and end rather than i and j? I've not got a strong opinion here…
				end = std::min(start + shardSize, buf.size());
				shardsIn[i] = makeArrayRef<uint8_t>(buf.data() + start, end - start);
				}

				// Compress shards and compute Alder-32 checksums.
				auto shardsOut = std::make_unique<SmallVector<uint8_t, 0>[]>(numShards);
				auto shardsAdler = std::make_unique<uint32_t[]>(numShards);
				parallelForEachN(0, numShards, [&](size_t i) {
				peter.smithUnsubmitted Done Reply Inline Actions The code above use idx for going through in[] and i for something else, could be worth using the same value? peter.smith: The code above use idx for going through in[] and i for something else, could be worth using…
				shardsOut[i] = deflateShard(shardsIn[i], level, i == numShards - 1);
				shardsAdler[i] = adler32(1, shardsIn[i].data(), shardsIn[i].size());
				});

	// Update section headers.			// Update section size and combine Alder-32 checksums.
	size = sizeof(Elf_Chdr) + compressedData.size();			uint32_t checksum = 1; // Initial Adler-32 value
				size = sizeof(Elf_Chdr) + 2; // Elf_Chdir and zlib header
				for (size_t i = 0; i != numShards; ++i) {
				size += shardsOut[i].size();
				checksum = adler32_combine(checksum, shardsAdler[i], shardsIn[i].size());
				}
				size += 4; // checksum

				compressed.shards = std::move(shardsOut);
				compressed.numShards = numShards;
				compressed.checksum = checksum;
	flags \|= SHF_COMPRESSED;			flags \|= SHF_COMPRESSED;
				#endif
	}			}

	static void writeInt(uint8_t *buf, uint64_t data, uint64_t size) {			static void writeInt(uint8_t *buf, uint64_t data, uint64_t size) {
	if (size == 1)			if (size == 1)
	*buf = data;			*buf = data;
	else if (size == 2)			else if (size == 2)
	write16(buf, data);			write16(buf, data);
	else if (size == 4)			else if (size == 4)
	write32(buf, data);			write32(buf, data);
	else if (size == 8)			else if (size == 8)
	write64(buf, data);			write64(buf, data);
	else			else
	llvm_unreachable("unsupported Size argument");			llvm_unreachable("unsupported Size argument");
	}			}

	template <class ELFT> void OutputSection::writeTo(uint8_t *buf) {			template <class ELFT> void OutputSection::writeTo(uint8_t *buf) {
	llvm::TimeTraceScope timeScope("Write sections", name);			llvm::TimeTraceScope timeScope("Write sections", name);
	if (type == SHT_NOBITS)			if (type == SHT_NOBITS)
	return;			return;

	// If --compress-debug-section is specified and if this is a debug section,			// If --compress-debug-section is specified and if this is a debug section,
	// we've already compressed section contents. If that's the case,			// we've already compressed section contents. If that's the case,
	// just write it down.			// just write it down.
	if (!compressedData.empty()) {			if (compressed.shards) {
	memcpy(buf, zDebugHeader.data(), zDebugHeader.size());			memcpy(buf, zDebugHeader.data(), zDebugHeader.size());
	memcpy(buf + zDebugHeader.size(), compressedData.data(),			buf += zDebugHeader.size();
	compressedData.size());			size -= zDebugHeader.size();

				// Compute shard offsets.
				auto offsets = std::make_unique<size_t[]>(compressed.numShards);
				offsets[0] = 2; // zlib header
				for (size_t i = 1; i != compressed.numShards; ++i)
				offsets[i] = offsets[i - 1] + compressed.shards[i - 1].size();

				buf[0] = 0x78; // CMF
				buf[1] = 0x01; // FLG: best speed
				parallelForEachN(0, compressed.numShards, [&](size_t i) {
				memcpy(buf + offsets[i], compressed.shards[i].data(),
				compressed.shards[i].size());
				});

				write32be(buf + size - 4, compressed.checksum);
	return;			return;
	}			}

	// Write leading padding.			// Write leading padding.
	SmallVector<InputSection , 0> sections = getInputSections(this);			SmallVector<InputSection , 0> sections = getInputSections(this);
	std::array<uint8_t, 4> filler = getFiller();			std::array<uint8_t, 4> filler = getFiller();
	bool nonZeroFiller = read32(filler.data()) != 0;			bool nonZeroFiller = read32(filler.data()) != 0;
	if (nonZeroFiller)			if (nonZeroFiller)
	▲ Show 20 Lines • Show All 245 Lines • Show Last 20 Lines