This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
ELF/
-
InputSection.h
-
OutputSections.h
2
OutputSections.cpp

Differential D27146

Merge strings using a probabilistic algorithm to reduce latency.
AbandonedPublic

Authored by ruiu on Nov 26 2016, 9:11 PM.

Download Raw Diff

Details

Reviewers

silvas
• rafael

Summary

I'm sending this patch to get fedback. I haven't convince even myself
that this is the right thing to do. But this should be interesting
to those who want to see what we can do to improve linker's latency.

String merging is one of the slowest passes in LLD because of the
sheer number of mergeable strings. For example, Clang with debug info
contains 30 millions of mergeable strings (average length is about 50
bytes). They need to be uniquified, and uniquified strings need to
get consecutive offsets in the resulting string table.

Currently, we are using a (single-threaded, regular) dense map for
string unification. Merging the 30 million strings takes about 2
seconds on my machine.

This patch implements one of my ideas about how to reduce latency by
parallelizing it. This algorithm is probabilistic, meaining that
even though duplicated strings are likely to be merged, that's not
guaranteed. As a result, it produces larger string table quickly.
(If you need to optimize in size, you could still pass -O2 which
does tail-merging.)

Here's how it works.

In the first step, we take 10% of input string set to create a small
string table. The resulting string table is very unlikely to contain
all strings of the entire set, but it is likely to contain most of
duplicated strings, because duplicated strings are repeated many times.

The second step processes the remaining 90% in parallel. In this step,
we do not merge strings. So, if a string is not in the small string
table we created in the first step, that will just be appended to end
of the string table. This step completes the string table.

Here are some numbers of resulting clang executables:

Size of .debug_str section:
Current            108,049,822   (+0%)
Probabilistic      154,089,550   (+42.6%)
No string merging  1,591,388,940 (+1472.8%)

Size of resulting file:
Current            1,440,453,528 (+0%)
Probabilistic      1,490,597,448 (+3.5%)
No string merging  2,945,020,808 (+204.5%)

The probabilistic algorithm produces larger string table, but that's
much smaller than that without string merging. Compared to the entire
executable size, the loss is only 3.5%.

Here is a speedup in latency:

Before:

   36098.025468 task-clock (msec)         #    5.256 CPUs utilized            ( +-  0.95% )
        190,770 context-switches          #    0.005 M/sec                    ( +-  0.25% )
          7,609 cpu-migrations            #    0.211 K/sec                    ( +- 11.40% )
      2,378,416 page-faults               #    0.066 M/sec                    ( +-  0.07% )
 99,645,202,279 cycles                    #    2.760 GHz                      ( +-  0.94% )
 81,128,226,367 stalled-cycles-frontend   #   81.42% frontend cycles idle     ( +-  1.10% )
<not supported> stalled-cycles-backend
 45,662,681,567 instructions              #    0.46  insns per cycle
                                          #    1.78  stalled cycles per insn  ( +-  0.14% )
  8,864,616,311 branches                  #  245.571 M/sec                    ( +-  0.22% )
    146,360,227 branch-misses             #    1.65% of all branches          ( +-  0.06% )

    6.868559257 seconds time elapsed                                          ( +-  0.50% )

After:

   36905.733802 task-clock (msec)         #    7.061 CPUs utilized            ( +-  0.84% )
        159,813 context-switches          #    0.004 M/sec                    ( +-  0.24% )
          8,079 cpu-migrations            #    0.219 K/sec                    ( +- 12.67% )
      2,296,298 page-faults               #    0.062 M/sec                    ( +-  0.21% )
102,178,380,224 cycles                    #    2.769 GHz                      ( +-  0.83% )
 83,846,653,367 stalled-cycles-frontend   #   82.06% frontend cycles idle     ( +-  0.96% )
<not supported> stalled-cycles-backend
 46,138,345,206 instructions              #    0.45  insns per cycle
                                          #    1.82  stalled cycles per insn  ( +-  0.15% )
  8,824,763,690 branches                  #  239.116 M/sec                    ( +-  0.24% )
    142,482,338 branch-misses             #    1.61% of all branches          ( +-  0.05% )

    5.227024403 seconds time elapsed                                          ( +-  0.43% )

In terms of latency, this algorithm is a clear win.

With these results, I have a feeling that this algorithm could be
a reasonable addition to LLD. Only for a few percent of loss in size,
it reduces latency by about 25%, so it might be a good option for
daily edit-build-test cycles (on the other hand, disabling string
merging with -O0 creates 2x larger executables, which is sometimes
inconvenient even for daily development cycle.) You can still pass
-O2 to produce production binaries.

I have another idea to reduce string merging latency, so I'll
implement that later for comparison.

Diff Detail

Build Status

Buildable 1614
Build 1614: arc lint + arc unit

Event Timeline

ruiu updated this revision to Diff 79345.Nov 26 2016, 9:11 PM

ruiu retitled this revision from to Merge strings using a probabilistic algorithm to reduce latency..

ruiu updated this object.

ruiu added reviewers: • rafael, silvas.

ruiu added a subscriber: llvm-commits.

This is a really neat idea! Great observation that the string deduplication doesn't have to be completely perfect.

One question: how much possible speedup is there from optimizing this code path, and how much of that does the current patch get?

My biggest concern with this approach is how to get deterministic output from this approach.

As it stands, the current patch is very similar to having StringTableBuilder only use StringIndexMap.insert some fraction of the time, and the rest of the time just using find and appending the string to the table if it isn't found.

ELF/OutputSections.cpp
633	This will result in non-deterministic output :/

One other question: how much of this speedup is due to the parallelizing, and how much due to the improved algorithm?

Hi Rui,

Why can't you use two-bit counting Bloom filter to efficiently calculate whether the given string would need to be merged or not? Bloom filter filling could be done in parallel and on the second pass you'd insert the strings in the map only if they are known to have count > 1.

If you need the code for counting BF that could be filled in in parallel - let me know.

One other approach (which might even complement this) is that we might be able to come up with a fast heuristic for determining if a string is likely unique or likely not-unique. I assume there is some sort of pattern; e.g. probably file paths are duplicated a lot in debug info. We could calculate these heuristics nearly for free at the same time as we hash the strings.

With such a heuristic, we can bypass the hash table lookup altogether.

I looked more closely at the data (LLD self-link, which includes LLVM for LTO so is reasonably large). I don't think that there will be an easy static heuristic. The bulk are symbol names and the duplication factor is just a complicated function of how the input program is structured (e.g. if a particular class is a common utility used in many places, etc.).

Also, one interesting thing is that the total size is not dominated by strings with any particular number of duplicates. It is somewhat randomly spread out across various discrete numbers of duplicates.

Here is a plot. The horizontal axis is the number of duplicates. The vertical axis is the number of bytes represented by strings with that number of duplicates. Area under the curve is proportional to total string size for the non-deduplicated output.
https://reviews.llvm.org/F2622328

Another useful thing to look at is that same plot, but "integrated": https://reviews.llvm.org/F2622335

This makes it easier to see why the approach in this patch works well. About 80% of string mass is present in strings which are duplicated more than 50 times.
There are approximately 750 files in the link. So assuming random sampling, if we look at 10% of them (so, 75), then of the strings with more than 50 repetitions, we will miss approximately (700 / 750) * (699 / 749) * (698 / 748) * ... * (625 / 675) ~= 2% of them; so this approach will save almost all of that 80% string mass.
(this is just to get intuition; of course, some strings that are replicated less than 50 times will be deduplicated too, but that will only increase the savings).

Sean, effectively counting bloom filter would give almost exact answer whether the string is a singleton or duplicated. This way no hash table lookups / insertions are even necessary.

String merging shrinks the size of the string table from 1591 MB to 108 MB, so we already know that a string is a duplicate with 93% probability. Bloom filter could improve it to, say, 99%, but will it make a difference?

Sean, thank you for the investigation and the numbers! I knew some of the numbers (that's why I could come up with this), but I just implemented, set the threshold to 10%, and it just worked, so I didn't take a close look at each section contents.

One thing I'd like to note in addition to your results is that strings are already uniquified for each input section (for obvious reason... compilers don't emit redundant strings.) So the likelihood of identifying duplicate strings when picking up random samples from input sections is larger than picking up random samples from 30 million strings. Instead of a single bag containing 30 million strings, we have approximately 1000 bags containing 30 million strings in total, and inside each bag all strings are unique. That's why I construct the small string table from random samples of input sections instead of random samples of section pieces.

One other question: how much of this speedup is due to the parallelizing, and how much due to the improved algorithm?

The speedup is entirely due to the parallelizing. If you run this with a single thread, you'd get the same latency as before. And that's a good property because this algorithm doesn't slow down the linker when available CPU resource is scarce.

ELF/OutputSections.cpp
633	Hmm, that's true, and I have no good idea to make it deterministic at the moment. We can partly serialize it, but it would slow down this part of process.

ruiu mentioned this in D27152: Merge strings using sharded hash tables..Nov 27 2016, 1:12 PM

In D27146#606203, @ruiu wrote:

Sean, thank you for the investigation and the numbers! I knew some of the numbers (that's why I could come up with this), but I just implemented, set the threshold to 10%, and it just worked, so I didn't take a close look at each section contents.

One thing I'd like to note in addition to your results is that strings are already uniquified for each input section (for obvious reason... compilers don't emit redundant strings.) So the likelihood of identifying duplicate strings when picking up random samples from input sections is larger than picking up random samples from 30 million strings. Instead of a single bag containing 30 million strings, we have approximately 1000 bags containing 30 million strings in total, and inside each bag all strings are unique. That's why I construct the small string table from random samples of input sections instead of random samples of section pieces.

Nice observation!

One other question: how much of this speedup is due to the parallelizing, and how much due to the improved algorithm?

The speedup is entirely due to the parallelizing. If you run this with a single thread, you'd get the same latency as before. And that's a good property because this algorithm doesn't slow down the linker when available CPU resource is scarce.

An approach like this could improve overall CPU time by reducing the amount of time in hash table lookups. Smaller hash tables are faster to look up, and so we can bail out faster for strings without duplicates.
Also, the observation that false negatives don't hurt can be used to speed up the hash table (but that would be quite a bit of work). I think you're observation that the string merging for -O1 can accept false negatives for detecting duplicates is the key idea. There are lots of different optimizations to be done based on this.

I was going to suggest using modified bloom filters that detect duplicates, but Anton beat me to it! (I didn't know such a thing already had a name "counting bloom filters").
If we already have hash values for each string, then doing the bloom filter lookup is pretty cheap, and it is easy to combine
This talk is a good example of bloom filters in a similar problem (except that "input object files" are "documents" and "strings" are "terms"): https://www.youtube.com/watch?v=80LKF2qph6I
The good thing is that the final bloom filter can be constructed hierarchically in parallel. I'm not sure how useful they would be for this application though since for string deduplication "most" strings are duplicates, and so we need to fetch the offset anyway, so there is not much advantage to try to filter out the cases where we *don't* need to do the offset lookup.

Rafael,

Sorry for not marking this as "abandoned". I wrote three patches to approach this problem from different ways, and https://reviews.llvm.org/D27152 is the one that I want to check in. Can you take a look if you are interested?

Revision Contents

Path

Size

ELF/

InputSection.h

8 lines

OutputSections.h

6 lines

OutputSections.cpp

119 lines

Diff 79345

ELF/InputSection.h

	Show First 20 Lines • Show All 154 Lines • ▼ Show 20 Lines
	};			};

	// SectionPiece represents a piece of splittable section contents.			// SectionPiece represents a piece of splittable section contents.
	// We allocate a lot of these and binary search on them. This means that they			// We allocate a lot of these and binary search on them. This means that they
	// have to be as compact as possible, which is why we don't store the size (can			// have to be as compact as possible, which is why we don't store the size (can
	// be found by looking at the next one) and put the hash in a side table.			// be found by looking at the next one) and put the hash in a side table.
	struct SectionPiece {			struct SectionPiece {
	SectionPiece(size_t Off, bool Live = false)			SectionPiece(size_t Off, bool Live = false)
	: InputOff(Off), OutputOff(-1), Live(Live \|\| !Config->GcSections) {}			: InputOff(Off), Live(Live \|\| !Config->GcSections), OutputOff(-1),
				First(false) {}

	size_t InputOff;			size_t InputOff : 8 * sizeof(size_t) - 1;
	ssize_t OutputOff : 8 * sizeof(ssize_t) - 1;
	size_t Live : 1;			size_t Live : 1;
				ssize_t OutputOff : 8 * sizeof(ssize_t) - 1;
				size_t First : 1;
	};			};
	static_assert(sizeof(SectionPiece) == 2 * sizeof(size_t),			static_assert(sizeof(SectionPiece) == 2 * sizeof(size_t),
	"SectionPiece is too big");			"SectionPiece is too big");

	// This corresponds to a SHF_MERGE section of an input file.			// This corresponds to a SHF_MERGE section of an input file.
	template <class ELFT> class MergeInputSection : public InputSectionBase<ELFT> {			template <class ELFT> class MergeInputSection : public InputSectionBase<ELFT> {
	typedef typename ELFT::uint uintX_t;			typedef typename ELFT::uint uintX_t;
	typedef typename ELFT::Sym Elf_Sym;			typedef typename ELFT::Sym Elf_Sym;
	▲ Show 20 Lines • Show All 136 Lines • Show Last 20 Lines

ELF/OutputSections.h

Show All 10 Lines
#define LLD_ELF_OUTPUT_SECTIONS_H		#define LLD_ELF_OUTPUT_SECTIONS_H

#include "Config.h"		#include "Config.h"
#include "Relocations.h"		#include "Relocations.h"

#include "lld/Core/LLVM.h"		#include "lld/Core/LLVM.h"
#include "llvm/MC/StringTableBuilder.h"		#include "llvm/MC/StringTableBuilder.h"
#include "llvm/Object/ELF.h"		#include "llvm/Object/ELF.h"
		#include <functional>

namespace lld {		namespace lld {
namespace elf {		namespace elf {

class SymbolBody;		class SymbolBody;
struct EhSectionPiece;		struct EhSectionPiece;
		struct SectionPiece;
template <class ELFT> class EhInputSection;		template <class ELFT> class EhInputSection;
template <class ELFT> class InputSection;		template <class ELFT> class InputSection;
template <class ELFT> class InputSectionBase;		template <class ELFT> class InputSectionBase;
template <class ELFT> class MergeInputSection;		template <class ELFT> class MergeInputSection;
template <class ELFT> class OutputSection;		template <class ELFT> class OutputSection;
template <class ELFT> class ObjectFile;		template <class ELFT> class ObjectFile;
template <class ELFT> class SharedFile;		template <class ELFT> class SharedFile;
template <class ELFT> class SharedSymbol;		template <class ELFT> class SharedSymbol;
▲ Show 20 Lines • Show All 104 Lines • ▼ Show 20 Lines	public:
Kind getKind() const override { return Merge; }		Kind getKind() const override { return Merge; }
static bool classof(const OutputSectionBase *B) {		static bool classof(const OutputSectionBase *B) {
return B->getKind() == Merge;		return B->getKind() == Merge;
}		}

private:		private:
void finalizeTailMerge();		void finalizeTailMerge();
void finalizeNoTailMerge();		void finalizeNoTailMerge();
		void forEachPiece(
		ArrayRef<MergeInputSection<ELFT> *> Sections,
		std::function<void(SectionPiece &Piece, llvm::CachedHashStringRef S)> Fn);

llvm::StringTableBuilder Builder;		llvm::StringTableBuilder Builder;
std::vector<MergeInputSection<ELFT> *> Sections;		std::vector<MergeInputSection<ELFT> *> Sections;
		size_t StringAlignment;
};		};

struct CieRecord {		struct CieRecord {
EhSectionPiece *Piece = nullptr;		EhSectionPiece *Piece = nullptr;
std::vector<EhSectionPiece *> FdePieces;		std::vector<EhSectionPiece *> FdePieces;
};		};

// Output section for .eh_frame.		// Output section for .eh_frame.
▲ Show 20 Lines • Show All 120 Lines • Show Last 20 Lines

ELF/OutputSections.cpp

Show All 15 Lines
#include "SymbolTable.h"		#include "SymbolTable.h"
#include "SyntheticSections.h"		#include "SyntheticSections.h"
#include "Target.h"		#include "Target.h"
#include "lld/Core/Parallel.h"		#include "lld/Core/Parallel.h"
#include "llvm/Support/Dwarf.h"		#include "llvm/Support/Dwarf.h"
#include "llvm/Support/MD5.h"		#include "llvm/Support/MD5.h"
#include "llvm/Support/MathExtras.h"		#include "llvm/Support/MathExtras.h"
#include "llvm/Support/SHA1.h"		#include "llvm/Support/SHA1.h"
		#include <atomic>

using namespace llvm;		using namespace llvm;
using namespace llvm::dwarf;		using namespace llvm::dwarf;
using namespace llvm::object;		using namespace llvm::object;
using namespace llvm::support::endian;		using namespace llvm::support::endian;
using namespace llvm::ELF;		using namespace llvm::ELF;

using namespace lld;		using namespace lld;
▲ Show 20 Lines • Show All 433 Lines • ▼ Show 20 Lines	if (In<ELFT>::EhFrameHdr) {
}		}
}		}
}		}

template <class ELFT>		template <class ELFT>
MergeOutputSection<ELFT>::MergeOutputSection(StringRef Name, uint32_t Type,		MergeOutputSection<ELFT>::MergeOutputSection(StringRef Name, uint32_t Type,
uintX_t Flags, uintX_t Alignment)		uintX_t Flags, uintX_t Alignment)
: OutputSectionBase(Name, Type, Flags),		: OutputSectionBase(Name, Type, Flags),
Builder(StringTableBuilder::RAW, Alignment) {}		Builder(StringTableBuilder::RAW, Alignment), StringAlignment(Alignment) {
		assert(Alignment != 0 && isPowerOf2_64(Alignment));
		}

template <class ELFT> void MergeOutputSection<ELFT>::writeTo(uint8_t *Buf) {		template <class ELFT> void MergeOutputSection<ELFT>::writeTo(uint8_t *Buf) {
		if (shouldTailMerge()) {
Builder.write(Buf);		Builder.write(Buf);
		return;
		}

		// Builder is not used for probabilistic string merging.
		for (MergeInputSection<ELFT> *Sec : Sections) {
		for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I) {
		SectionPiece &Piece = Sec->Pieces[I];
		if (!Piece.Live \|\| !Piece.First)
		continue;
		StringRef S = Sec->getData(I).val();
		memcpy(Buf + Piece.OutputOff, S.data(), S.size());
		}
		}
}		}

template <class ELFT>		template <class ELFT>
void MergeOutputSection<ELFT>::addSection(InputSectionData *C) {		void MergeOutputSection<ELFT>::addSection(InputSectionData *C) {
auto *Sec = cast<MergeInputSection<ELFT>>(C);		auto *Sec = cast<MergeInputSection<ELFT>>(C);
Sec->OutSec = this;		Sec->OutSec = this;
this->updateAlignment(Sec->Alignment);		this->updateAlignment(Sec->Alignment);
this->Entsize = Sec->Entsize;		this->Entsize = Sec->Entsize;
Show All 34 Lines	for (MergeInputSection<ELFT> *Sec : Sections)
for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I)		for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I)
if (Sec->Pieces[I].Live)		if (Sec->Pieces[I].Live)
Sec->Pieces[I].OutputOff = Builder.add(Sec->getData(I));		Sec->Pieces[I].OutputOff = Builder.add(Sec->getData(I));

Builder.finalizeInOrder();		Builder.finalizeInOrder();
this->Size = Builder.getSize();		this->Size = Builder.getSize();
}		}

		static size_t align2(size_t Val, size_t Alignment) {
		return (Val + Alignment - 1) & ~(Alignment - 1);
		}

		// Call Fn for each section piece.
		template <class ELFT>
		void MergeOutputSection<ELFT>::forEachPiece(
		ArrayRef<MergeInputSection<ELFT> *> Sections,
		std::function<void(SectionPiece &Piece, CachedHashStringRef S)> Fn) {
		for (MergeInputSection<ELFT> *Sec : Sections)
		for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I)
		if (Sec->Pieces[I].Live)
		Fn(Sec->Pieces[I], Sec->getData(I));
		}

		// Split a vector Vec into smaller vectors.
		template <class T>
		static std::vector<std::vector<T>> split(std::vector<T> Vec, size_t NumShards) {
		std::vector<std::vector<T>> Ret(NumShards);
		size_t I = 0;
		for (T &Elem : Vec)
		Ret[I++ % NumShards].push_back(Elem);
		return Ret;
		}

template <class ELFT> void MergeOutputSection<ELFT>::finalize() {		template <class ELFT> void MergeOutputSection<ELFT>::finalize() {
if (shouldTailMerge())		if (shouldTailMerge()) {
finalizeTailMerge();		finalizeTailMerge();
else		return;
finalizeNoTailMerge();		}

		// This implements a probabilistic string merging algorithm.
		//
		// In this function, we merge identical strings and assign contiguous
		// offsets to unique strings to create a string table. This is one of the
		// most time-consuming passes in LLD because of the sheer number of strings.
		// When we link large programs such as Clang with debug info, we need to
		// merge thousands of sections containing millions of string pieces.
		// On my Xeon 2.8 GHz machine, merging 30 million strings (average length
		// is about 50 bytes) using a single-threaded hash table takes about 2
		// seconds.
		//
		// The probabilistic algorithm improves the latency to 300 milliseconds by
		// allowing a loss in output size. In other words, this algorithm faster
		// but produces larger string table. If you need to optimize in size,
		// you should pass -O2 to LLD.
		//
		// Here's how it works. In the first step, we take 10% of input string set
		// to create a small string table. The resulting string table is very
		// unlikely to contain all strings of the entire set, but it is likely to
		// contain most of duplicated strings, because duplicated strings are
		// repeated many times.
		//
		// The second step processes the remaining 90% in parallel. In this step,
		// we do not merge strings. So, if a string is not in the small string
		// table we created in the first step, that will be just appended to end
		// of the string table. This step completes the string table.

		DenseMap<CachedHashStringRef, size_t> OffsetMap;
		std::atomic<size_t> Offset;
		Offset.store(0);

		size_t NumShards = 100;
		std::vector<std::vector<MergeInputSection<ELFT> *>> Shards =
		split(Sections, NumShards);

		// Step 1: construct a small string table
		for (size_t I = 0; I < NumShards / 10; ++I) {
		forEachPiece(Shards[I], [&](SectionPiece &Piece, CachedHashStringRef S) {
		auto P = OffsetMap.insert({S, Offset.load()});
		if (P.second) {
		Piece.First = true;
		Piece.OutputOff = Offset.load();
		Offset.store(align2(Offset.load() + S.size(), StringAlignment));
		} else {
		Piece.OutputOff = P.first->second;
		}
		});
		}

		// Step 2: append remaining strings
		parallel_for_each(
		Shards.begin() + NumShards / 10, Shards.end(),
		[&](ArrayRef<MergeInputSection<ELFT> *> Sections) {
		forEachPiece(Sections, [&](SectionPiece &Piece, CachedHashStringRef S) {
		auto It = OffsetMap.find(S);
		if (It == OffsetMap.end()) {
		Piece.First = true;
		size_t Size = align2(S.size(), StringAlignment);
		Piece.OutputOff = Offset.fetch_add(Size);
		silvasUnsubmitted Not Done Reply Inline Actions This will result in non-deterministic output :/ silvas: This will result in non-deterministic output :/
		ruiuAuthorUnsubmitted Not Done Reply Inline Actions Hmm, that's true, and I have no good idea to make it deterministic at the moment. We can partly serialize it, but it would slow down this part of process. ruiu: Hmm, that's true, and I have no good idea to make it deterministic at the moment. We can partly…
		} else {
		Piece.OutputOff = It->second;
		}
		});
		});

		this->Size = Offset.load();
}		}

template <class ELFT>		template <class ELFT>
static typename ELFT::uint getOutFlags(InputSectionBase<ELFT> *S) {		static typename ELFT::uint getOutFlags(InputSectionBase<ELFT> *S) {
return S->Flags & ~SHF_GROUP & ~SHF_COMPRESSED;		return S->Flags & ~SHF_GROUP & ~SHF_COMPRESSED;
}		}

template <class ELFT>		template <class ELFT>
▲ Show 20 Lines • Show All 114 Lines • Show Last 20 Lines