This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lld/ELF/
-
ELF/
7/7
OutputSections.cpp

Differential D133679

[ELF] Parallelize --compress-debug-sections=zstd
ClosedPublic

Authored by MaskRay on Sep 11 2022, 5:08 PM.

Download Raw Diff

Details

Reviewers

ikudrin
andrewng
peter.smith

Commits

rGfa74144c64df: [ELF] Parallelize --compress-debug-sections=zstd

Summary

See D117853: compressing debug sections is a bottleneck and therefore it
has a large value parallizing the step.

zstd provides multi-threading API and the output is deterministic even with
different numbers of threads (see https://github.com/facebook/zstd/issues/2238).
Therefore we can leverage it instead of using the pigz-style sharding approach.

Also, switch to the default compression level 3. The current level 5
is significantly slower without providing justifying size benefit.

  'dash b.sh 1' ran
    1.05 ± 0.01 times faster than 'dash b.sh 3'
    1.18 ± 0.01 times faster than 'dash b.sh 4'
    1.29 ± 0.02 times faster than 'dash b.sh 5'

level=1 size: 358946945
level=3 size: 309002145
level=4 size: 307693204
level=5 size: 297828315

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

MaskRay created this revision.Sep 11 2022, 5:08 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 11 2022, 5:08 PM

Herald added subscribers: StephenFan, arichardson, emaste. · View Herald Transcript

This patch is derived from the following zstd parallelism experiment:

# Build zstd with cmake
git clone https://github.com/facebook/zstd
cd zstd
cmake -GNinja -Hbuild/cmake -Bout/release -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/tmp/opt/zstd
make -j 8 install

% g++ -O2 -g z.cc -L/tmp/p/zstd/out/release/lib -lzstd -o z
% time ./z debug_info debug_info.zstd 1
./z debug_info debug_info.zstd 1  3.30s user 0.75s system 113% cpu 3.574 total
% time ./z debug_info debug_info.zstd 2
./z debug_info debug_info.zstd 2  3.39s user 0.71s system 182% cpu 2.239 total
% time ./z debug_info debug_info.zstd 4
./z debug_info debug_info.zstd 4  3.47s user 0.63s system 267% cpu 1.533 total
% time ./z debug_info debug_info.zstd 8
./z debug_info debug_info.zstd 8  3.76s user 0.66s system 349% cpu 1.263 total

The cli program is significantly faster. I do not know whether it's the program is async reading or other feature I have missed. Filed https://github.com/llvm/llvm-project/issues/57685

% time /tmp/p/zstd/out/release/programs/zstd -fq -T1 debug_info
/tmp/p/zstd/out/release/programs/zstd -fq -T1 debug_info  2.98s user 0.51s system 126% cpu 2.767 total
% time /tmp/p/zstd/out/release/programs/zstd -fq -T2 debug_info
/tmp/p/zstd/out/release/programs/zstd -fq -T2 debug_info  3.02s user 0.52s system 235% cpu 1.501 total
% time /tmp/p/zstd/out/release/programs/zstd -fq -T4 debug_info
/tmp/p/zstd/out/release/programs/zstd -fq -T4 debug_info  3.02s user 0.51s system 435% cpu 0.811 total

#include <algorithm>
#include <vector>

#include <fcntl.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>
#include <zstd.h>

int main(int argc, char *argv[]) {
  int fdin = open(argv[1], O_RDONLY);
  if (fdin < 0) return 1;
  struct stat st;
  if (fstat(fdin, &st) < 0) return 1;
  void *in = mmap(0, st.st_size, PROT_READ, MAP_SHARED, fdin, 0);
  if (in == MAP_FAILED) return 1;
  int fdout = open(argv[2], O_RDWR);
  if (fdout < 0) return 1;
  int th = 0;
  if (argc > 3)
    th = atoi(argv[3]);

  std::vector<uint8_t> out;
  out.resize(64);
  size_t pos = 0;

  ZSTD_CCtx *cctx = ZSTD_createCCtx();
  if (!cctx)
    return 1;
  if (ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_nbWorkers, th)))
    return 2;
  ZSTD_outBuffer zob = {out.data(), out.size(), 0};
  auto directive = ZSTD_e_continue;
  do {
    size_t n = std::min(st.st_size-pos, (size_t)1<<20);
    if (n == st.st_size-pos)
      directive = ZSTD_e_end;
    ZSTD_inBuffer zib = { (char*)in+pos, n, 0 };
    size_t more = 1;
    while (zib.pos != zib.size || directive == ZSTD_e_end && more != 0) {
      if (zob.pos == zob.size) {
        out.resize(out.size() * 3 / 2);
        zob.dst = out.data();
        zob.size = out.size();
      }

      more = ZSTD_compressStream2(cctx, &zob, &zib, directive);
      if (ZSTD_isError(more)) {
        fprintf(stderr, "%s\n", ZSTD_getErrorName(more));
        return 3;
      }
    }
    pos += n;
  } while (directive != ZSTD_e_end);

  out.resize(zob.pos);
  ftruncate(fdout, out.size());

  void *mout = mmap(0, out.size(), PROT_READ | PROT_WRITE, MAP_SHARED, fdout, 0);
  memcpy(mout, out.data(), out.size());
  munmap(mout, out.size());
  close(fdout);

  ZSTD_freeCCtx(cctx);
}

git clone https://github.com/llvm/llvm-project.git --depth=1
cd llvm-project
curl -L 'https://reviews.llvm.org/D133679?download=1' | patch -p1

# Build lld. See https://llvm.org/docs/GettingStarted.html
cmake -GNinja -Sllvm -B/tmp/out/custom1 -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS=lld -DLLVM_ENABLE_ZSTD=FORCE_ON -DCMAKE_PREFIX_PATH=/tmp/opt/zstd -DLLVM_ENABLE_LLD=on
ninja -C /tmp/out/custom1 lld

No compression

% time /tmp/out/custom1/bin/ld.lld @response.txt -o a.out --threads=2
/tmp/out/custom1/bin/ld.lld @response.txt -o a.out --threads=2  9.89s user 2.92s system 151% cpu 8.477 total
% time /tmp/out/custom1/bin/ld.lld @response.txt -o a.out --threads=4
/tmp/out/custom1/bin/ld.lld @response.txt -o a.out --threads=4  10.82s user 3.08s system 209% cpu 6.640 total

zstd

% time /tmp/out/custom1/bin/ld.lld --compress-debug-sections=zstd @response.txt -o a.out --threads=1
/tmp/out/custom1/bin/ld.lld --compress-debug-sections=zstd @response.txt -o    14.19s user 3.10s system 104% cpu 16.532 total
% time /tmp/out/custom1/bin/ld.lld --compress-debug-sections=zstd @response.txt -o a.out --threads=2
/tmp/out/custom1/bin/ld.lld --compress-debug-sections=zstd @response.txt -o    15.16s user 3.83s system 162% cpu 11.657 total
% time /tmp/out/custom1/bin/ld.lld --compress-debug-sections=zstd @response.txt -o a.out --threads=4
/tmp/out/custom1/bin/ld.lld --compress-debug-sections=zstd @response.txt -o    16.73s user 3.77s system 219% cpu 9.323 total
% time /tmp/out/custom1/bin/ld.lld --compress-debug-sections=zstd @response.txt -o a.out --threads=8
/tmp/out/custom1/bin/ld.lld --compress-debug-sections=zstd @response.txt -o    18.97s user 4.04s system 280% cpu 8.194 total

zlib

% time /tmp/out/custom1/bin/ld.lld --compress-debug-sections=zlib @response.txt -o a.out --threads=2
/tmp/out/custom1/bin/ld.lld --compress-debug-sections=zlib @response.txt -o    23.68s user 3.02s system 168% cpu 15.805 total
% time /tmp/out/custom1/bin/ld.lld --compress-debug-sections=zlib @response.txt -o a.out --threads=4
/tmp/out/custom1/bin/ld.lld --compress-debug-sections=zlib @response.txt -o    24.55s user 3.43s system 253% cpu 11.036 total

Harbormaster completed remote builds in B186098: Diff 459391.Sep 11 2022, 5:19 PM

update

MaskRay published this revision for review.Sep 11 2022, 5:37 PM

MaskRay retitled this revision from [WIP][ELF] Parallelize --compress-debug-sections=zstd to [ELF] Parallelize --compress-debug-sections=zstd.

MaskRay added reviewers: ikudrin, andrewng, peter.smith.

Herald added a project: Restricted Project. · View Herald TranscriptSep 11 2022, 5:37 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B186099: Diff 459392.Sep 11 2022, 5:47 PM

MaskRay edited the summary of this revision. (Show Details)Sep 11 2022, 9:39 PM

dblaikie added a subscriber: dblaikie.Sep 12 2022, 10:26 AM

Ping:)

I've left some comments based on a first read of the code. The code looks reasonable to me. I'm not in a position to mention if this gives bad peformance on Windows (if there is a ZSTD consumer on that platform anyway) though.

lld/ELF/OutputSections.cpp
393	I think this is ZSTD's streaming compression? If I'm right could be worth saying something like: Use ZSTD's streaming compression API which permits parallel workers working on the stream. See http://facebook.github.io/zstd/zstd_manual.html Streaming compression -HowTo.
409	Could be worth making a const variable for 1 << 20 with a self descriptive name. Otherwise could be worth a comment on the choice of value.
413	more reads like it should be a boolean. Perhaps bytesRemaining?
421	Does ZSTD guarantee no error for the inputs that we are giving it? If we can't guarantee it then perhaps this should be a fatal error message.

address comments

lld/ELF/OutputSections.cpp
393	Thanks for the comment. Adopted.
409	Switched to `ZSTD_CStreamInSize()` to avoid using a magic number. It's a soft recommendation (https://github.com/llvm/llvm-project/issues/57685#issuecomment-1244950193), though.
421	From the source code it's guaranteed if the usage is correct. The author uses an assert, too: https://github.com/llvm/llvm-project/issues/57685#issuecomment-1244008295

Harbormaster completed remote builds in B187870: Diff 461746.Sep 20 2022, 4:03 PM

Thanks for the update. Code changes look good to me and are localised to ZSTD. Will be worth leaving a little bit of time for objections from other reviewers.

This revision is now accepted and ready to land.Sep 21 2022, 1:03 AM

LGTM and performance benefit is good on Windows too.

This revision was landed with ongoing or failed builds.Sep 21 2022, 11:13 AM

Closed by commit rGfa74144c64df: [ELF] Parallelize --compress-debug-sections=zstd (authored by MaskRay). · Explain Why

This revision was automatically updated to reflect the committed changes.

MaskRay added a commit: rGfa74144c64df: [ELF] Parallelize --compress-debug-sections=zstd.

This broke building with LLVM_LINK_LLVM_DYLIB, I went ahead and fixed it in 525a400c7ca5725b4ab456b222176f580caf35e7.

Revision Contents

Path

Size

lld/

ELF/

OutputSections.cpp

62 lines

Diff 461957

lld/ELF/OutputSections.cpp

Show All 18 Lines
#include "llvm/Config/llvm-config.h" // LLVM_ENABLE_ZLIB		#include "llvm/Config/llvm-config.h" // LLVM_ENABLE_ZLIB
#include "llvm/Support/Compression.h"		#include "llvm/Support/Compression.h"
#include "llvm/Support/Parallel.h"		#include "llvm/Support/Parallel.h"
#include "llvm/Support/Path.h"		#include "llvm/Support/Path.h"
#include "llvm/Support/TimeProfiler.h"		#include "llvm/Support/TimeProfiler.h"
#if LLVM_ENABLE_ZLIB		#if LLVM_ENABLE_ZLIB
#include <zlib.h>		#include <zlib.h>
#endif		#endif
		#if LLVM_ENABLE_ZSTD
		#include <zstd.h>
		#endif

using namespace llvm;		using namespace llvm;
using namespace llvm::dwarf;		using namespace llvm::dwarf;
using namespace llvm::object;		using namespace llvm::object;
using namespace llvm::support::endian;		using namespace llvm::support::endian;
using namespace llvm::ELF;		using namespace llvm::ELF;
using namespace lld;		using namespace lld;
using namespace lld::elf;		using namespace lld::elf;
▲ Show 20 Lines • Show All 291 Lines • ▼ Show 20 Lines	template <class ELFT> void OutputSection::maybeCompress() {
// Compress only DWARF debug sections.		// Compress only DWARF debug sections.
if (config->compressDebugSections == DebugCompressionType::None \|\|		if (config->compressDebugSections == DebugCompressionType::None \|\|
(flags & SHF_ALLOC) \|\| !name.startswith(".debug_") \|\| size == 0)		(flags & SHF_ALLOC) \|\| !name.startswith(".debug_") \|\| size == 0)
return;		return;

llvm::TimeTraceScope timeScope("Compress debug sections");		llvm::TimeTraceScope timeScope("Compress debug sections");
compressed.uncompressedSize = size;		compressed.uncompressedSize = size;
auto buf = std::make_unique<uint8_t[]>(size);		auto buf = std::make_unique<uint8_t[]>(size);
if (config->compressDebugSections == DebugCompressionType::Zstd) {		// Write uncompressed data to a temporary zero-initialized buffer.
{		{
parallel::TaskGroup tg;		parallel::TaskGroup tg;
writeTo<ELFT>(buf.get(), tg);		writeTo<ELFT>(buf.get(), tg);
}		}

		#if LLVM_ENABLE_ZSTD
		// Use ZSTD's streaming compression API which permits parallel workers working
		// on the stream. See http://facebook.github.io/zstd/zstd_manual.html
		// "Streaming compression - HowTo".
		if (config->compressDebugSections == DebugCompressionType::Zstd) {
		// Allocate a buffer of half of the input size, and grow it by 1.5x if
		// insufficient.
compressed.shards = std::make_unique<SmallVector<uint8_t, 0>[]>(1);		compressed.shards = std::make_unique<SmallVector<uint8_t, 0>[]>(1);
compression::zstd::compress(makeArrayRef(buf.get(), size),		SmallVector<uint8_t, 0> &out = compressed.shards[0];
compressed.shards[0]);		out.resize_for_overwrite(std::max<size_t>(size / 2, 32));
size = sizeof(Elf_Chdr) + compressed.shards[0].size();		size_t pos = 0;

		ZSTD_CCtx *cctx = ZSTD_createCCtx();
		size_t ret = ZSTD_CCtx_setParameter(
		cctx, ZSTD_c_nbWorkers, parallel::strategy.compute_thread_count());
		if (ZSTD_isError(ret))
		fatal(Twine("ZSTD_CCtx_setParameter: ") + ZSTD_getErrorName(ret));
		ZSTD_outBuffer zob = {out.data(), out.size(), 0};
		ZSTD_EndDirective directive = ZSTD_e_continue;
		const size_t blockSize = ZSTD_CStreamInSize();
		do {
		const size_t n = std::min(size - pos, blockSize);
		if (n == size - pos)
		directive = ZSTD_e_end;
		ZSTD_inBuffer zib = {buf.get() + pos, n, 0};
		size_t bytesRemaining = 0;
		while (zib.pos != zib.size \|\|
		(directive == ZSTD_e_end && bytesRemaining != 0)) {
		if (zob.pos == zob.size) {
		out.resize_for_overwrite(out.size() * 3 / 2);
		zob.dst = out.data();
		zob.size = out.size();
		}
		bytesRemaining = ZSTD_compressStream2(cctx, &zob, &zib, directive);
		assert(!ZSTD_isError(bytesRemaining));
		}
		pos += n;
		} while (directive != ZSTD_e_end);
		out.resize(zob.pos);
		ZSTD_freeCCtx(cctx);

		size = sizeof(Elf_Chdr) + out.size();
flags \|= SHF_COMPRESSED;		flags \|= SHF_COMPRESSED;
return;		return;
}		}
		#endif

#if LLVM_ENABLE_ZLIB		#if LLVM_ENABLE_ZLIB
// Write uncompressed data to a temporary zero-initialized buffer.
{
parallel::TaskGroup tg;
writeTo<ELFT>(buf.get(), tg);
}
// We chose 1 (Z_BEST_SPEED) as the default compression level because it is		// We chose 1 (Z_BEST_SPEED) as the default compression level because it is
// the fastest. If -O2 is given, we use level 6 to compress debug info more by		// the fastest. If -O2 is given, we use level 6 to compress debug info more by
// ~15%. We found that level 7 to 9 doesn't make much difference (~1% more		// ~15%. We found that level 7 to 9 doesn't make much difference (~1% more
		peter.smithUnsubmitted Done Reply Inline Actions I think this is ZSTD's streaming compression? If I'm right could be worth saying something like: Use ZSTD's streaming compression API which permits parallel workers working on the stream. See http://facebook.github.io/zstd/zstd_manual.html Streaming compression -HowTo. peter.smith: I think this is ZSTD's streaming compression? If I'm right could be worth saying something like…
		MaskRayAuthorUnsubmitted Done Reply Inline Actions Thanks for the comment. Adopted. MaskRay: Thanks for the comment. Adopted.
// compression) while they take significant amount of time (~2x), so level 6		// compression) while they take significant amount of time (~2x), so level 6
// seems enough.		// seems enough.
const int level = config->optimize >= 2 ? 6 : Z_BEST_SPEED;		const int level = config->optimize >= 2 ? 6 : Z_BEST_SPEED;

// Split input into 1-MiB shards.		// Split input into 1-MiB shards.
constexpr size_t shardSize = 1 << 20;		constexpr size_t shardSize = 1 << 20;
auto shardsIn = split(makeArrayRef<uint8_t>(buf.get(), size), shardSize);		auto shardsIn = split(makeArrayRef<uint8_t>(buf.get(), size), shardSize);
const size_t numShards = shardsIn.size();		const size_t numShards = shardsIn.size();

// Compress shards and compute Alder-32 checksums. Use Z_SYNC_FLUSH for all		// Compress shards and compute Alder-32 checksums. Use Z_SYNC_FLUSH for all
// shards but the last to flush the output to a byte boundary to be		// shards but the last to flush the output to a byte boundary to be
// concatenated with the next shard.		// concatenated with the next shard.
auto shardsOut = std::make_unique<SmallVector<uint8_t, 0>[]>(numShards);		auto shardsOut = std::make_unique<SmallVector<uint8_t, 0>[]>(numShards);
auto shardsAdler = std::make_unique<uint32_t[]>(numShards);		auto shardsAdler = std::make_unique<uint32_t[]>(numShards);
parallelFor(0, numShards, [&](size_t i) {		parallelFor(0, numShards, [&](size_t i) {
shardsOut[i] = deflateShard(shardsIn[i], level,		shardsOut[i] = deflateShard(shardsIn[i], level,
		peter.smithUnsubmitted Done Reply Inline Actions Could be worth making a const variable for 1 << 20 with a self descriptive name. Otherwise could be worth a comment on the choice of value. peter.smith: Could be worth making a const variable for 1 << 20 with a self descriptive name. Otherwise…
		MaskRayAuthorUnsubmitted Done Reply Inline Actions Switched to `ZSTD_CStreamInSize()` to avoid using a magic number. It's a soft recommendation (https://github.com/llvm/llvm-project/issues/57685#issuecomment-1244950193), though. MaskRay: Switched to `ZSTD_CStreamInSize()` to avoid using a magic number. It's a soft recommendation…
i != numShards - 1 ? Z_SYNC_FLUSH : Z_FINISH);		i != numShards - 1 ? Z_SYNC_FLUSH : Z_FINISH);
shardsAdler[i] = adler32(1, shardsIn[i].data(), shardsIn[i].size());		shardsAdler[i] = adler32(1, shardsIn[i].data(), shardsIn[i].size());
});		});

		peter.smithUnsubmitted Done Reply Inline Actions more reads like it should be a boolean. Perhaps bytesRemaining? peter.smith: more reads like it should be a boolean. Perhaps bytesRemaining?
// Update section size and combine Alder-32 checksums.		// Update section size and combine Alder-32 checksums.
uint32_t checksum = 1; // Initial Adler-32 value		uint32_t checksum = 1; // Initial Adler-32 value
size = sizeof(Elf_Chdr) + 2; // Elf_Chdir and zlib header		size = sizeof(Elf_Chdr) + 2; // Elf_Chdir and zlib header
for (size_t i = 0; i != numShards; ++i) {		for (size_t i = 0; i != numShards; ++i) {
size += shardsOut[i].size();		size += shardsOut[i].size();
checksum = adler32_combine(checksum, shardsAdler[i], shardsIn[i].size());		checksum = adler32_combine(checksum, shardsAdler[i], shardsIn[i].size());
}		}
size += 4; // checksum		size += 4; // checksum
		peter.smithUnsubmitted Done Reply Inline Actions Does ZSTD guarantee no error for the inputs that we are giving it? If we can't guarantee it then perhaps this should be a fatal error message. peter.smith: Does ZSTD guarantee no error for the inputs that we are giving it? If we can't guarantee it…
		MaskRayAuthorUnsubmitted Done Reply Inline Actions From the source code it's guaranteed if the usage is correct. The author uses an assert, too: https://github.com/llvm/llvm-project/issues/57685#issuecomment-1244008295 MaskRay: From the source code it's guaranteed if the usage is correct. The author uses an assert, too…

compressed.shards = std::move(shardsOut);		compressed.shards = std::move(shardsOut);
compressed.numShards = numShards;		compressed.numShards = numShards;
compressed.checksum = checksum;		compressed.checksum = checksum;
flags \|= SHF_COMPRESSED;		flags \|= SHF_COMPRESSED;
#endif		#endif
}		}

▲ Show 20 Lines • Show All 346 Lines • Show Last 20 Lines