This is an archive of the discontinued LLVM Phabricator instance.

[ELF] Parallelize --compress-debug-sections=zstd
ClosedPublic

Authored by MaskRay on Sep 11 2022, 5:08 PM.

Details

Summary

See D117853: compressing debug sections is a bottleneck and therefore it
has a large value parallizing the step.

zstd provides multi-threading API and the output is deterministic even with
different numbers of threads (see https://github.com/facebook/zstd/issues/2238).
Therefore we can leverage it instead of using the pigz-style sharding approach.

Also, switch to the default compression level 3. The current level 5
is significantly slower without providing justifying size benefit.

  'dash b.sh 1' ran
    1.05 ± 0.01 times faster than 'dash b.sh 3'
    1.18 ± 0.01 times faster than 'dash b.sh 4'
    1.29 ± 0.02 times faster than 'dash b.sh 5'

level=1 size: 358946945
level=3 size: 309002145
level=4 size: 307693204
level=5 size: 297828315

Diff Detail

Event Timeline

MaskRay created this revision.Sep 11 2022, 5:08 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 11 2022, 5:08 PM
MaskRay added a comment.EditedSep 11 2022, 5:13 PM

This patch is derived from the following zstd parallelism experiment:

# Build zstd with cmake
git clone https://github.com/facebook/zstd
cd zstd
cmake -GNinja -Hbuild/cmake -Bout/release -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/tmp/opt/zstd
make -j 8 install
% g++ -O2 -g z.cc -L/tmp/p/zstd/out/release/lib -lzstd -o z
% time ./z debug_info debug_info.zstd 1
./z debug_info debug_info.zstd 1  3.30s user 0.75s system 113% cpu 3.574 total
% time ./z debug_info debug_info.zstd 2
./z debug_info debug_info.zstd 2  3.39s user 0.71s system 182% cpu 2.239 total
% time ./z debug_info debug_info.zstd 4
./z debug_info debug_info.zstd 4  3.47s user 0.63s system 267% cpu 1.533 total
% time ./z debug_info debug_info.zstd 8
./z debug_info debug_info.zstd 8  3.76s user 0.66s system 349% cpu 1.263 total

The cli program is significantly faster. I do not know whether it's the program is async reading or other feature I have missed. Filed https://github.com/llvm/llvm-project/issues/57685

% time /tmp/p/zstd/out/release/programs/zstd -fq -T1 debug_info
/tmp/p/zstd/out/release/programs/zstd -fq -T1 debug_info  2.98s user 0.51s system 126% cpu 2.767 total
% time /tmp/p/zstd/out/release/programs/zstd -fq -T2 debug_info
/tmp/p/zstd/out/release/programs/zstd -fq -T2 debug_info  3.02s user 0.52s system 235% cpu 1.501 total
% time /tmp/p/zstd/out/release/programs/zstd -fq -T4 debug_info
/tmp/p/zstd/out/release/programs/zstd -fq -T4 debug_info  3.02s user 0.51s system 435% cpu 0.811 total
#include <algorithm>
#include <vector>

#include <fcntl.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>
#include <zstd.h>

int main(int argc, char *argv[]) {
  int fdin = open(argv[1], O_RDONLY);
  if (fdin < 0) return 1;
  struct stat st;
  if (fstat(fdin, &st) < 0) return 1;
  void *in = mmap(0, st.st_size, PROT_READ, MAP_SHARED, fdin, 0);
  if (in == MAP_FAILED) return 1;
  int fdout = open(argv[2], O_RDWR);
  if (fdout < 0) return 1;
  int th = 0;
  if (argc > 3)
    th = atoi(argv[3]);

  std::vector<uint8_t> out;
  out.resize(64);
  size_t pos = 0;

  ZSTD_CCtx *cctx = ZSTD_createCCtx();
  if (!cctx)
    return 1;
  if (ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_nbWorkers, th)))
    return 2;
  ZSTD_outBuffer zob = {out.data(), out.size(), 0};
  auto directive = ZSTD_e_continue;
  do {
    size_t n = std::min(st.st_size-pos, (size_t)1<<20);
    if (n == st.st_size-pos)
      directive = ZSTD_e_end;
    ZSTD_inBuffer zib = { (char*)in+pos, n, 0 };
    size_t more = 1;
    while (zib.pos != zib.size || directive == ZSTD_e_end && more != 0) {
      if (zob.pos == zob.size) {
        out.resize(out.size() * 3 / 2);
        zob.dst = out.data();
        zob.size = out.size();
      }

      more = ZSTD_compressStream2(cctx, &zob, &zib, directive);
      if (ZSTD_isError(more)) {
        fprintf(stderr, "%s\n", ZSTD_getErrorName(more));
        return 3;
      }
    }
    pos += n;
  } while (directive != ZSTD_e_end);

  out.resize(zob.pos);
  ftruncate(fdout, out.size());

  void *mout = mmap(0, out.size(), PROT_READ | PROT_WRITE, MAP_SHARED, fdout, 0);
  memcpy(mout, out.data(), out.size());
  munmap(mout, out.size());
  close(fdout);

  ZSTD_freeCCtx(cctx);
}
MaskRay added a comment.EditedSep 11 2022, 5:19 PM
git clone https://github.com/llvm/llvm-project.git --depth=1
cd llvm-project
curl -L 'https://reviews.llvm.org/D133679?download=1' | patch -p1

# Build lld. See https://llvm.org/docs/GettingStarted.html
cmake -GNinja -Sllvm -B/tmp/out/custom1 -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS=lld -DLLVM_ENABLE_ZSTD=FORCE_ON -DCMAKE_PREFIX_PATH=/tmp/opt/zstd -DLLVM_ENABLE_LLD=on
ninja -C /tmp/out/custom1 lld
No compression

% time /tmp/out/custom1/bin/ld.lld @response.txt -o a.out --threads=2
/tmp/out/custom1/bin/ld.lld @response.txt -o a.out --threads=2  9.89s user 2.92s system 151% cpu 8.477 total
% time /tmp/out/custom1/bin/ld.lld @response.txt -o a.out --threads=4
/tmp/out/custom1/bin/ld.lld @response.txt -o a.out --threads=4  10.82s user 3.08s system 209% cpu 6.640 total

zstd

% time /tmp/out/custom1/bin/ld.lld --compress-debug-sections=zstd @response.txt -o a.out --threads=1
/tmp/out/custom1/bin/ld.lld --compress-debug-sections=zstd @response.txt -o    14.19s user 3.10s system 104% cpu 16.532 total
% time /tmp/out/custom1/bin/ld.lld --compress-debug-sections=zstd @response.txt -o a.out --threads=2
/tmp/out/custom1/bin/ld.lld --compress-debug-sections=zstd @response.txt -o    15.16s user 3.83s system 162% cpu 11.657 total
% time /tmp/out/custom1/bin/ld.lld --compress-debug-sections=zstd @response.txt -o a.out --threads=4
/tmp/out/custom1/bin/ld.lld --compress-debug-sections=zstd @response.txt -o    16.73s user 3.77s system 219% cpu 9.323 total
% time /tmp/out/custom1/bin/ld.lld --compress-debug-sections=zstd @response.txt -o a.out --threads=8
/tmp/out/custom1/bin/ld.lld --compress-debug-sections=zstd @response.txt -o    18.97s user 4.04s system 280% cpu 8.194 total

zlib

% time /tmp/out/custom1/bin/ld.lld --compress-debug-sections=zlib @response.txt -o a.out --threads=2
/tmp/out/custom1/bin/ld.lld --compress-debug-sections=zlib @response.txt -o    23.68s user 3.02s system 168% cpu 15.805 total
% time /tmp/out/custom1/bin/ld.lld --compress-debug-sections=zlib @response.txt -o a.out --threads=4
/tmp/out/custom1/bin/ld.lld --compress-debug-sections=zlib @response.txt -o    24.55s user 3.43s system 253% cpu 11.036 total
MaskRay published this revision for review.Sep 11 2022, 5:37 PM
MaskRay retitled this revision from [WIP][ELF] Parallelize --compress-debug-sections=zstd to [ELF] Parallelize --compress-debug-sections=zstd.
MaskRay added reviewers: ikudrin, andrewng, peter.smith.
Herald added a project: Restricted Project. · View Herald TranscriptSep 11 2022, 5:37 PM
MaskRay edited the summary of this revision. (Show Details)Sep 11 2022, 9:39 PM

I've left some comments based on a first read of the code. The code looks reasonable to me. I'm not in a position to mention if this gives bad peformance on Windows (if there is a ZSTD consumer on that platform anyway) though.

lld/ELF/OutputSections.cpp
393

I think this is ZSTD's streaming compression? If I'm right could be worth saying something like:
Use ZSTD's streaming compression API which permits parallel workers working on the stream.
See http://facebook.github.io/zstd/zstd_manual.html Streaming compression -HowTo.

409

Could be worth making a const variable for 1 << 20 with a self descriptive name. Otherwise could be worth a comment on the choice of value.

413

more reads like it should be a boolean. Perhaps bytesRemaining?

421

Does ZSTD guarantee no error for the inputs that we are giving it? If we can't guarantee it then perhaps this should be a fatal error message.

MaskRay updated this revision to Diff 461746.Sep 20 2022, 4:02 PM
MaskRay marked 4 inline comments as done.

address comments

lld/ELF/OutputSections.cpp
393

Thanks for the comment. Adopted.

409

Switched to ZSTD_CStreamInSize() to avoid using a magic number. It's a soft recommendation (https://github.com/llvm/llvm-project/issues/57685#issuecomment-1244950193), though.

421

From the source code it's guaranteed if the usage is correct. The author uses an assert, too: https://github.com/llvm/llvm-project/issues/57685#issuecomment-1244008295

peter.smith accepted this revision.Sep 21 2022, 1:03 AM

Thanks for the update. Code changes look good to me and are localised to ZSTD. Will be worth leaving a little bit of time for objections from other reviewers.

This revision is now accepted and ready to land.Sep 21 2022, 1:03 AM
andrewng accepted this revision.Sep 21 2022, 7:40 AM

LGTM and performance benefit is good on Windows too.

This revision was landed with ongoing or failed builds.Sep 21 2022, 11:13 AM
This revision was automatically updated to reflect the committed changes.

This broke building with LLVM_LINK_LLVM_DYLIB, I went ahead and fixed it in 525a400c7ca5725b4ab456b222176f580caf35e7.