This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
ELF/
-
InputSection.h
-
OutputSections.h
8
OutputSections.cpp
-
test/ELF/
-
ELF/
-
comment-gc.s
-
compressed-debug-input.s
-
debug-gc.s
-
gc-sections-merge.s
-
merge-string-align.s
-
merge.s

Differential D27152

Merge strings using sharded hash tables.
ClosedPublic

Authored by ruiu on Nov 27 2016, 1:12 PM.

Download Raw Diff

Details

Reviewers

silvas
• espindola

Summary

This is another attempt to speed up string merging. You want to read
the description of https://reviews.llvm.org/D27146 first.

In this patch, I took a different approach than the probabilistic
algorithm used in D27146. Here is the algorithm.

The original code has a single hash table to merge strings. Now we
have N hash tables where N is a parallelism (currently N=16).

We invoke N threads. Each thread knows its thread index I where
0 <= I < N. For each string S in a given string set, thread I adds S
to its own hash table only if hash(S) % N == I.

When all threads are done, there are N string tables with all
duplicated strings being merged.

There are pros and cons of this algorithm compared to the
probabilistic one.

Pros:

It naturally produces deterministic output.
Output is guaranteed to be the smallest.

Cons:

Slower than the probabilistic algorithm due to the work it needs to do. N threads independently visit all strings, and because the number of mergeable strings it too large, even just skipping them is fairly expensive.

On the other hand, the probabilistic algorithm doesn't need to skip any element.
Unlike the probabilistic algorithm, it degrades performance if the number of available CPU core is smaller than N, because we now have more work to do in total than the original code.

We can fix this if we are able to know in some way about how many cores are idle.

Here are perf results. The probabilistic algorithm completed the same
task in 5.227 seconds, so this algorithm is slower than that.

Before:

   36095.759481 task-clock (msec)         #    5.539 CPUs utilized            ( +-  0.83% )
        191,033 context-switches          #    0.005 M/sec                    ( +-  0.22% )
          8,194 cpu-migrations            #    0.227 K/sec                    ( +- 12.24% )
      2,342,017 page-faults               #    0.065 M/sec                    ( +-  0.06% )
 99,758,779,851 cycles                    #    2.764 GHz                      ( +-  0.79% )
 80,526,137,412 stalled-cycles-frontend   #   80.72% frontend cycles idle     ( +-  0.95% )
<not supported> stalled-cycles-backend
 46,308,518,501 instructions              #    0.46  insns per cycle
                                          #    1.74  stalled cycles per insn  ( +-  0.12% )
  8,962,860,074 branches                  #  248.308 M/sec                    ( +-  0.17% )
    149,264,611 branch-misses             #    1.67% of all branches          ( +-  0.06% )

    6.517101649 seconds time elapsed                                          ( +-  0.42% )

After:

   45346.098328 task-clock (msec)         #    8.002 CPUs utilized            ( +-  0.77% )
        165,487 context-switches          #    0.004 M/sec                    ( +-  0.24% )
          7,455 cpu-migrations            #    0.164 K/sec                    ( +- 11.13% )
      2,347,870 page-faults               #    0.052 M/sec                    ( +-  0.84% )
125,725,992,168 cycles                    #    2.773 GHz                      ( +-  0.76% )
 96,550,047,016 stalled-cycles-frontend   #   76.79% frontend cycles idle     ( +-  0.89% )
<not supported> stalled-cycles-backend
 79,847,589,597 instructions              #    0.64  insns per cycle
                                          #    1.21  stalled cycles per insn  ( +-  0.22% )
 13,569,202,477 branches                  #  299.236 M/sec                    ( +-  0.28% )
    200,343,507 branch-misses             #    1.48% of all branches          ( +-  0.16% )

    5.666585908 seconds time elapsed                                          ( +-  0.67% )

To conclude, I lean towards the probabilistic algorithm if we can
make its output deterministic, since its faster in any sitaution
(except for pathetic inputs in which our assumption that most
duplicated strings are spread across inputs doesn't hold.)

Diff Detail

Build Status

Buildable 2034
Build 2034: arc lint + arc unit

Event Timeline

Removed unused code.

This is the way I was going to suggest making the probabilistic approach deterministic.

Unlike the probabilistic algorithm, it degrades performance if the number of available CPU core is smaller than N, because we now have more work to do in total than the original code.

Not exactly. You are assuming that there is no contention for the Offset.fetch_add(Size);; keep in mind that that atomic add can actually be as expensive as one of the hash table lookups. If the small string map is covering, say 95% of the strings, that still means that you are spending 1/20 of your time contending with other cores (these are just ballpark numbers). Using an approach like in this patch where each thread can operate truly independently (well, except for false sharing as I noted above) is generally easier to reason about and can scale better.

Also, the approach in this patch is an interesting contrast with the map lookups, since each map shard will generally be smaller, and so lookups in the hash table are faster. On the other hand, the cores aren't sharing the map in LLC, which the approach in D27146 does do.

Also, one thing that may be worth considering is reducing the overall number of times that we read all the strings from DRAM. Currently we do it a couple times I can think of:

in MergeInputSection<ELFT>::splitStrings
in the string deduplication
when writing to the output

Ideally we could do some amount of this deduplication work (or all of it) in 1, while we still have the string in cache right after hashing it.

ELF/OutputSections.cpp
655	If two cores try to update nearby section pieces they will have false sharing. SectionPiece is only 16 bytes. So on x86-64 hosted LLD there are 4 of them per cache line, so assuming the hashes are distributed well, it's not unlikely for them to collide. Even worse, there is negative feedback that prevents individual cores from running ahead. So the will have a hard time "spreading out" and stop stepping on each other's toes. Since they all visit the string tables in the same order, if one goes ahead, it will start to experience LLC cache misses when it goes to a new string table, which will slow it down causing the others to catch up with it. When there are three threads T1, T2, T3 where T1 is ahead of T2 which is ahead of T3, the following will happen: T1 updates a section piece. This leave a "landmine" for T2 to trip on (it needs to fetch and invalidate T1's cacheline when it tries to write). When T2 trips on it, then T3 can catch up with T2, etc. So the net result is that the cores won't easily spread out. I can think of two solutions: Have forEachPiece visit the input object files in a different order for each shard (this may hurt cache locality for the string values though) Do a locality-improving sort (even just a couple levels of partitioning would be enough; it doesn't have to be perfectly sorted) of the string pieces. It doesn't even have to be sort-like really. It just needs to somehow permute the entries in a way that avoids false sharing.
673	One interesting thing about this patch is that it shows that visiting all the pieces is not too expensive. This suggests that maybe doing some amount of preprocessing on the pieces in order to improve locality for the other parts could be worth it.

I was about to commit D27155, but it turned out that the single-core
performance of that change is too poor to check-in. So I want to submit
this patch instead.

This doesn't scale that much compared to D27155, but its single-core
performance doesn't such as well. Here is clang link time in seconds.
As you can see, it's performance is competitive if you have at least
two cores. However, it doesn't scale more than 8 cores. (I guess the
small improvements over 8 cores comes from other parts of the linker.)

of cores Before After 1 13.462 15.795 +17.33% 2 9.766 10.046 +2.86% 4 7.697 7.228 -6.09% 8 6.888 5.672 -17.65% 12 7.073 5.848 -17.31% 16 7.066 5.746 -18.68% 20 6.846 5.482 -19.92%

Linking clang-scale programs only with single core is very painful
anyways, so I think this performance characteristics is okay.

LLD spawns the same number of threads as the number of hardware cores. I didn't adjust that part, but just bound processes to specific cores to run benchmarks. If I adjust the thread count, the performance numbers got better. Here are correct numbers.

Single core: 14.76 seconds
Two cores: 9.121 seconds

When running with just a single core, can we still do a sharded visitation, but instead of doing NumShards passes through all the pieces (skipping ones not in the current shard), only do a single pass? We should be able to get identical layout in that case.

I.e., instead of

        if ((Hash % NumShards) != Idx)
          continue;
...
        auto P = OffsetMap[Idx].insert({{S, Hash}, Off});

instead do:

auto P = OffsetMap[Hash % NumShards].insert({{S, Hash}, Off});

Maybe that can recover some of the single-threaded performance? If not, then we definitely need comments explaining what is causing the single-thread slowdown (which is equivalent to a throughput reduction; improving latency is good, but the throughput reduction seems to be about 20%+ which might be a problem (for example, for a buildbot))

ELF/OutputSections.cpp
539–621	Do we already have a helper for this in libsupport?
663	Please pad this so that there isn't false sharing. DenseMap is smaller than a cacheline IIRC and so currently different threads will have false sharing.

updated as per review comments.

Resurrected the original code to use the (non-concurrent) string table table to use it when -no-thread is given, so that this patch doesn't hurt when threads are explicitly disabled.

ELF/OutputSections.cpp
539–621	Done.
663	Done. It actually reduced the latency of this function by almost 10%. Wow.

LGTM with nits.

Although I think that this type of optimization makes sense to do, really debug fission is the best approach for reducing this cost like Rafael said in the other thread. Make sure to check with Rafael that he doesn't feel strongly that we should avoid doing this right now.

My reasoning is that one of the main selling points for LLD is that it is fast. It should be fast an there should be no fine-print (e.g. have to build LLD an uncommon way like PGO or LTO, have to change your workflow to use fission, etc.). It's fine for us to have additional advice for users about how to make their links faster, but people are probably not going to be following that advice the first time they try LLD, and that's the time where they decide to keep using LLD or not.

ELF/OutputSections.cpp
567	I think you can use `alignas` to simplify this. (r287783 seems to be surviving in tree using alignas, so I think we can use it). Also, I think something like `alignas` is actually needed to get the desired effect in all cases. Currently, OffsetMapTy probably has an actual alignment requirement of `sizeof(void*)`, so an arrangement like: First part of DenseMap1 ------- cacheline boundary ------- Second part of DenseMap1 Padding First part of DenseMap2 ------- cacheline boundary ------- Second part of DenseMap2 .... is still possible, which will still have false sharing. Doubling the padding to 2x the cacheline size avoids this, but `alignas` is probably simpler.
636	The environment variable is unused (and if you want to use it, the same comment as before about more precise naming applies).

This revision is now accepted and ready to land.Dec 9 2016, 6:03 PM

we landed a different version of this IIRC

Herald added a reviewer: • espindola. · View Herald TranscriptMar 25 2020, 6:33 PM

Herald added subscribers: jfb, MaskRay, emaste. · View Herald Transcript

Revision Contents

Path

Size

ELF/

InputSection.h

18 lines

OutputSections.h

4 lines

OutputSections.cpp

137 lines

test/

ELF/

comment-gc.s

2 lines

compressed-debug-input.s

8 lines

4 lines

2 lines

4 lines

48 lines

Diff 80964

ELF/InputSection.h

Show First 20 Lines • Show All 154 Lines • ▼ Show 20 Lines
};		};

// SectionPiece represents a piece of splittable section contents.		// SectionPiece represents a piece of splittable section contents.
// We allocate a lot of these and binary search on them. This means that they		// We allocate a lot of these and binary search on them. This means that they
// have to be as compact as possible, which is why we don't store the size (can		// have to be as compact as possible, which is why we don't store the size (can
// be found by looking at the next one) and put the hash in a side table.		// be found by looking at the next one) and put the hash in a side table.
struct SectionPiece {		struct SectionPiece {
SectionPiece(size_t Off, bool Live = false)		SectionPiece(size_t Off, bool Live = false)
: InputOff(Off), OutputOff(-1), Live(Live \|\| !Config->GcSections) {}		: InputOff(Off), Live(Live \|\| !Config->GcSections), First(false),
		OutputOff(-1) {}

size_t InputOff;		size_t InputOff : 8 * sizeof(size_t) - 2;
ssize_t OutputOff : 8 * sizeof(ssize_t) - 1;
size_t Live : 1;		size_t Live : 1;
		size_t First : 1;
		ssize_t OutputOff;
};		};
static_assert(sizeof(SectionPiece) == 2 * sizeof(size_t),		static_assert(sizeof(SectionPiece) == 2 * sizeof(size_t),
"SectionPiece is too big");		"SectionPiece is too big");

// This corresponds to a SHF_MERGE section of an input file.		// This corresponds to a SHF_MERGE section of an input file.
template <class ELFT> class MergeInputSection : public InputSectionBase<ELFT> {		template <class ELFT> class MergeInputSection : public InputSectionBase<ELFT> {
typedef typename ELFT::uint uintX_t;		typedef typename ELFT::uint uintX_t;
typedef typename ELFT::Sym Elf_Sym;		typedef typename ELFT::Sym Elf_Sym;
Show All 13 Lines	public:

// Translate an offset in the input section to an offset		// Translate an offset in the input section to an offset
// in the output section.		// in the output section.
uintX_t getOffset(uintX_t Offset) const;		uintX_t getOffset(uintX_t Offset) const;

// Splittable sections are handled as a sequence of data		// Splittable sections are handled as a sequence of data
// rather than a single large blob of data.		// rather than a single large blob of data.
std::vector<SectionPiece> Pieces;		std::vector<SectionPiece> Pieces;
		std::vector<uint32_t> Hashes;

// Returns I'th piece's data. This function is very hot when		// Returns I'th piece's data. This function is very hot when
// string merging is enabled, so we want to inline.		// string merging is enabled, so we want to inline.
LLVM_ATTRIBUTE_ALWAYS_INLINE		StringRef getString(size_t I) const {
llvm::CachedHashStringRef getData(size_t I) const {
size_t Begin = Pieces[I].InputOff;		size_t Begin = Pieces[I].InputOff;
size_t End;		size_t End;
if (Pieces.size() - 1 == I)		if (Pieces.size() - 1 == I)
End = this->Data.size();		End = this->Data.size();
else		else
End = Pieces[I + 1].InputOff;		End = Pieces[I + 1].InputOff;
		return {(const char *)(this->Data.data() + Begin), End - Begin};
StringRef S = {(const char *)(this->Data.data() + Begin), End - Begin};
return {S, Hashes[I]};
}		}

// Returns the SectionPiece at a given input section offset.		// Returns the SectionPiece at a given input section offset.
SectionPiece *getSectionPiece(uintX_t Offset);		SectionPiece *getSectionPiece(uintX_t Offset);
const SectionPiece *getSectionPiece(uintX_t Offset) const;		const SectionPiece *getSectionPiece(uintX_t Offset) const;

private:		private:
void splitStrings(ArrayRef<uint8_t> A, size_t Size);		void splitStrings(ArrayRef<uint8_t> A, size_t Size);
void splitNonStrings(ArrayRef<uint8_t> A, size_t Size);		void splitNonStrings(ArrayRef<uint8_t> A, size_t Size);

std::vector<uint32_t> Hashes;

mutable llvm::DenseMap<uintX_t, uintX_t> OffsetMap;		mutable llvm::DenseMap<uintX_t, uintX_t> OffsetMap;
mutable std::once_flag InitOffsetMap;		mutable std::once_flag InitOffsetMap;

llvm::DenseSet<uintX_t> LiveOffsets;		llvm::DenseSet<uintX_t> LiveOffsets;
};		};

struct EhSectionPiece : public SectionPiece {		struct EhSectionPiece : public SectionPiece {
EhSectionPiece(size_t Off, InputSectionData *ID, uint32_t Size,		EhSectionPiece(size_t Off, InputSectionData *ID, uint32_t Size,
▲ Show 20 Lines • Show All 95 Lines • Show Last 20 Lines

ELF/OutputSections.h

Show First 20 Lines • Show All 139 Lines • ▼ Show 20 Lines	public:
Kind getKind() const override { return Merge; }		Kind getKind() const override { return Merge; }
static bool classof(const OutputSectionBase *B) {		static bool classof(const OutputSectionBase *B) {
return B->getKind() == Merge;		return B->getKind() == Merge;
}		}

private:		private:
void finalizeTailMerge();		void finalizeTailMerge();
void finalizeNoTailMerge();		void finalizeNoTailMerge();
		void finalizeConcurrent();

llvm::StringTableBuilder Builder;		std::unique_ptr<llvm::StringTableBuilder> Builder;
std::vector<MergeInputSection<ELFT> *> Sections;		std::vector<MergeInputSection<ELFT> *> Sections;
		size_t Alignment;
};		};

struct CieRecord {		struct CieRecord {
EhSectionPiece *Piece = nullptr;		EhSectionPiece *Piece = nullptr;
std::vector<EhSectionPiece *> FdePieces;		std::vector<EhSectionPiece *> FdePieces;
};		};

// Output section for .eh_frame.		// Output section for .eh_frame.
▲ Show 20 Lines • Show All 120 Lines • Show Last 20 Lines

ELF/OutputSections.cpp

Show All 15 Lines
#include "SyntheticSections.h"		#include "SyntheticSections.h"
#include "Target.h"		#include "Target.h"
#include "Threads.h"		#include "Threads.h"
#include "lld/Support/Memory.h"		#include "lld/Support/Memory.h"
#include "llvm/Support/Dwarf.h"		#include "llvm/Support/Dwarf.h"
#include "llvm/Support/MD5.h"		#include "llvm/Support/MD5.h"
#include "llvm/Support/MathExtras.h"		#include "llvm/Support/MathExtras.h"
#include "llvm/Support/SHA1.h"		#include "llvm/Support/SHA1.h"
		#include <atomic>

using namespace llvm;		using namespace llvm;
using namespace llvm::dwarf;		using namespace llvm::dwarf;
using namespace llvm::object;		using namespace llvm::object;
using namespace llvm::support::endian;		using namespace llvm::support::endian;
using namespace llvm::ELF;		using namespace llvm::ELF;

using namespace lld;		using namespace lld;
▲ Show 20 Lines • Show All 430 Lines • ▼ Show 20 Lines	for (CieRecord *Cie : Cies) {
}		}
}		}
}		}
}		}

template <class ELFT>		template <class ELFT>
MergeOutputSection<ELFT>::MergeOutputSection(StringRef Name, uint32_t Type,		MergeOutputSection<ELFT>::MergeOutputSection(StringRef Name, uint32_t Type,
uintX_t Flags, uintX_t Alignment)		uintX_t Flags, uintX_t Alignment)
: OutputSectionBase(Name, Type, Flags),		: OutputSectionBase(Name, Type, Flags), Alignment(Alignment) {
Builder(StringTableBuilder::RAW, Alignment) {}		assert(Alignment != 0 && isPowerOf2_64(Alignment));
		}

template <class ELFT> void MergeOutputSection<ELFT>::writeTo(uint8_t *Buf) {		template <class ELFT> void MergeOutputSection<ELFT>::writeTo(uint8_t *Buf) {
Builder.write(Buf);		if (Builder) {
		Builder->write(Buf);
		return;
		}

		// Builder is not used for sharded string table construction.
		for (MergeInputSection<ELFT> *Sec : Sections) {
		for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I) {
		SectionPiece &Piece = Sec->Pieces[I];
		if (!Piece.Live \|\| !Piece.First)
		continue;

		StringRef S = Sec->getString(I);
		memcpy(Buf + Piece.OutputOff, S.data(), S.size());
		}
		}
}		}

template <class ELFT>		template <class ELFT>
void MergeOutputSection<ELFT>::addSection(InputSectionData *C) {		void MergeOutputSection<ELFT>::addSection(InputSectionData *C) {
auto *Sec = cast<MergeInputSection<ELFT>>(C);		auto *Sec = cast<MergeInputSection<ELFT>>(C);
Sec->OutSec = this;		Sec->OutSec = this;
this->updateAlignment(Sec->Alignment);		this->updateAlignment(Sec->Alignment);
this->Entsize = Sec->Entsize;		this->Entsize = Sec->Entsize;
Sections.push_back(Sec);		Sections.push_back(Sec);
}		}

template <class ELFT> bool MergeOutputSection<ELFT>::shouldTailMerge() const {		template <class ELFT> bool MergeOutputSection<ELFT>::shouldTailMerge() const {
return (this->Flags & SHF_STRINGS) && Config->Optimize >= 2;		return (this->Flags & SHF_STRINGS) && Config->Optimize >= 2;
}		}

template <class ELFT> void MergeOutputSection<ELFT>::finalizeTailMerge() {		template <class ELFT> void MergeOutputSection<ELFT>::finalizeTailMerge() {
// Add all string pieces to the string table builder to create section		// Add all string pieces to the string table builder to create section
// contents.		// contents.
for (MergeInputSection<ELFT> *Sec : Sections)		for (MergeInputSection<ELFT> *Sec : Sections)
for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I)		for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I)
if (Sec->Pieces[I].Live)		if (Sec->Pieces[I].Live)
Builder.add(Sec->getData(I));		Builder->add({Sec->getString(I), Sec->Hashes[I]});

// Fix the string table content. After this, the contents will never change.		// Fix the string table content. After this, the contents will never change.
Builder.finalize();		Builder->finalize();
this->Size = Builder.getSize();		this->Size = Builder->getSize();

// finalize() fixed tail-optimized strings, so we can now get		// finalize() fixed tail-optimized strings, so we can now get
// offsets of strings. Get an offset for each string and save it		// offsets of strings. Get an offset for each string and save it
// to a corresponding StringPiece for easy access.		// to a corresponding StringPiece for easy access.
for (MergeInputSection<ELFT> *Sec : Sections)		for (MergeInputSection<ELFT> *Sec : Sections)
for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I)		for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I)
if (Sec->Pieces[I].Live)		if (Sec->Pieces[I].Live)
Sec->Pieces[I].OutputOff = Builder.getOffset(Sec->getData(I));		Sec->Pieces[I].OutputOff =
		Builder->getOffset({Sec->getString(I), Sec->Hashes[I]});
}		}

template <class ELFT> void MergeOutputSection<ELFT>::finalizeNoTailMerge() {		template <class ELFT> void MergeOutputSection<ELFT>::finalizeNoTailMerge() {
// Add all string pieces to the string table builder to create section		// Add all string pieces to the string table builder to create section
// contents. Because we are not tail-optimizing, offsets of strings are		// contents. Because we are not tail-optimizing, offsets of strings are
// fixed when they are added to the builder (string table builder contains		// fixed when they are added to the builder (string table builder contains
// a hash table from strings to offsets).		// a hash table from strings to offsets).
for (MergeInputSection<ELFT> *Sec : Sections)		for (MergeInputSection<ELFT> *Sec : Sections)
for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I)		for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I)
if (Sec->Pieces[I].Live)		if (Sec->Pieces[I].Live)
Sec->Pieces[I].OutputOff = Builder.add(Sec->getData(I));		Sec->Pieces[I].OutputOff = Builder->add(Sec->getString(I));

		Builder->finalizeInOrder();
		this->Size = Builder->getSize();
		}

		// If tail merge is not needed, we construct a string table ourselves
		// to use a parallel algorithm.
		//
		// Here, we create not only one but N hash tables, where N is a
		// parallelism. We invoke N threads. Each thread knows its thread
		// index T where 0 <= T < N. For each string S in a given string set,
		// thread T adds S to its own hash table only if hash(S) % N == T.
		//
		// The following loop can finish in 1/N time if you have infinite
		// number of CPU cores, but in reality you don't want to make N
		// infinitely large, because all N threads have to visit all section
		// pieces. Larger N means larger overall cost. Skipping 30 million
		// section pieces takes about 50 milliseconds, for example, so it's
		// not very cheap. Currently, N=8 is chosen as a reasonable default.
		//
		// (In case you are wondering: we cannot adjust N at runtime because
		// changing N changes output layout. We want to make the linker emit
		// the same output on every invocation for the same input, so
		// adjusting N at runtime that is not a viable option.)
		template <class ELFT> void MergeOutputSection<ELFT>::finalizeConcurrent() {
		// This is a DenseMap with padding at end to prevent false sharing.
		// We assume cache line size is 64 bytes.
		typedef struct {
		DenseMap<CachedHashStringRef, size_t> Map;
		uint8_t Pad[(64 - sizeof(Map) > 0) ? 64 - sizeof(Map) : 1];
		silvasUnsubmitted Not Done Reply Inline Actions I think you can use `alignas` to simplify this. (r287783 seems to be surviving in tree using alignas, so I think we can use it). Also, I think something like `alignas` is actually needed to get the desired effect in all cases. Currently, OffsetMapTy probably has an actual alignment requirement of `sizeof(void)`, so an arrangement like: First part of DenseMap1 ------- cacheline boundary ------- Second part of DenseMap1 Padding First part of DenseMap2 ------- cacheline boundary ------- Second part of DenseMap2 .... is still possible, which will still have false sharing. Doubling the padding to 2x the cacheline size avoids this, but `alignas` is probably simpler. silvas:* I think you can use `alignas` to simplify this. (r287783 seems to be surviving in tree using…
		} OffsetMapTy;

		const int NumShards = 8;
		OffsetMapTy OffsetMap[NumShards];
		size_t ShardSize[NumShards];

		// Construct NumShards number of string tables in parallel.
		forLoop(0, NumShards, [&](size_t Idx) {
		size_t Offset = 0;
		for (MergeInputSection<ELFT> *Sec : Sections) {
		for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I) {
		uint32_t Hash = Sec->Hashes[I];
		if ((Hash % NumShards) != Idx)
		continue;

		SectionPiece &Piece = Sec->Pieces[I];
		if (!Piece.Live)
		continue;

		StringRef S = Sec->getString(I);
		size_t Off = alignTo(Offset, Alignment);
		auto P = OffsetMap[Idx].Map.insert({{S, Hash}, Off});
		if (P.second) {
		Piece.First = true;
		Piece.OutputOff = Off;
		Offset = Off + S.size();
		} else {
		Piece.OutputOff = P.first->second;
		}
		}
		}
		ShardSize[Idx] = Offset;
		});

		// Piece.OutputOff was set independently, so we need to fix them.
		// First, we compute starting offset for each shard.
		size_t ShardOffset[NumShards];
		ShardOffset[0] = 0;
		for (int I = 1; I != NumShards; ++I) {
		size_t Off = ShardOffset[I - 1] + ShardSize[I - 1];
		if (ShardSize[I] > 0)
		Off = alignTo(Off, Alignment);
		ShardOffset[I] = Off;
		}

		// Add a shard offset to each section piece.
		forEach(Sections.begin(), Sections.end(), [&](MergeInputSection<ELFT> *Sec) {
		for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I)
		if (Sec->Pieces[I].Live)
		Sec->Pieces[I].OutputOff += ShardOffset[Sec->Hashes[I] % NumShards];
		});

Builder.finalizeInOrder();		// Set the size of this output section.
this->Size = Builder.getSize();		this->Size = ShardOffset[NumShards - 1] + ShardSize[NumShards - 1];
		silvasUnsubmitted Not Done Reply Inline Actions Do we already have a helper for this in libsupport? silvas: Do we already have a helper for this in libsupport?
		ruiuAuthorUnsubmitted Not Done Reply Inline Actions Done. ruiu: Done.
}		}

template <class ELFT> void MergeOutputSection<ELFT>::finalize() {		template <class ELFT> void MergeOutputSection<ELFT>::finalize() {
if (shouldTailMerge())		// If we need tail merging, use the default string table builder
		// because our concurrent one does not support tail merging.
		if (shouldTailMerge()) {
		Builder.reset(new StringTableBuilder(StringTableBuilder::RAW, Alignment));
finalizeTailMerge();		finalizeTailMerge();
else		return;
		}

		// If threading is disabled, use the default string table builder
		// because our concurrent one is not as fast as the single-threaded
		// one on single core.
		if (!Config->Threads \|\| StringRef(getenv("LLD_TEST")) == "1") {
		silvasUnsubmitted Not Done Reply Inline Actions The environment variable is unused (and if you want to use it, the same comment as before about more precise naming applies). silvas: The environment variable is unused (and if you want to use it, the same comment as before about…
		Builder.reset(new StringTableBuilder(StringTableBuilder::RAW, Alignment));
finalizeNoTailMerge();		finalizeNoTailMerge();
		return;
		}

		finalizeConcurrent();
}		}

template <class ELFT>		template <class ELFT>
static typename ELFT::uint getOutFlags(InputSectionBase<ELFT> *S) {		static typename ELFT::uint getOutFlags(InputSectionBase<ELFT> *S) {
return S->Flags & ~SHF_GROUP & ~SHF_COMPRESSED;		return S->Flags & ~SHF_GROUP & ~SHF_COMPRESSED;
}		}

template <class ELFT>		template <class ELFT>
static SectionKey<ELFT::Is64Bits> createKey(InputSectionBase<ELFT> *C,		static SectionKey<ELFT::Is64Bits> createKey(InputSectionBase<ELFT> *C,
StringRef OutsecName) {		StringRef OutsecName) {
typedef typename ELFT::uint uintX_t;		typedef typename ELFT::uint uintX_t;
uintX_t Flags = getOutFlags(C);		uintX_t Flags = getOutFlags(C);

		silvasUnsubmitted Not Done Reply Inline Actions If two cores try to update nearby section pieces they will have false sharing. SectionPiece is only 16 bytes. So on x86-64 hosted LLD there are 4 of them per cache line, so assuming the hashes are distributed well, it's not unlikely for them to collide. Even worse, there is negative feedback that prevents individual cores from running ahead. So the will have a hard time "spreading out" and stop stepping on each other's toes. Since they all visit the string tables in the same order, if one goes ahead, it will start to experience LLC cache misses when it goes to a new string table, which will slow it down causing the others to catch up with it. When there are three threads T1, T2, T3 where T1 is ahead of T2 which is ahead of T3, the following will happen: T1 updates a section piece. This leave a "landmine" for T2 to trip on (it needs to fetch and invalidate T1's cacheline when it tries to write). When T2 trips on it, then T3 can catch up with T2, etc. So the net result is that the cores won't easily spread out. I can think of two solutions: Have forEachPiece visit the input object files in a different order for each shard (this may hurt cache locality for the string values though) Do a locality-improving sort (even just a couple levels of partitioning would be enough; it doesn't have to be perfectly sorted) of the string pieces. It doesn't even have to be sort-like really. It just needs to somehow permute the entries in a way that avoids false sharing. silvas: If two cores try to update nearby section pieces they will have false sharing. SectionPiece is…
// For SHF_MERGE we create different output sections for each alignment.		// For SHF_MERGE we create different output sections for each alignment.
// This makes each output section simple and keeps a single level mapping from		// This makes each output section simple and keeps a single level mapping from
// input to output.		// input to output.
// In case of relocatable object generation we do not try to perform merging		// In case of relocatable object generation we do not try to perform merging
// and treat SHF_MERGE sections as regular ones, but also create different		// and treat SHF_MERGE sections as regular ones, but also create different
// output sections for them to allow merging at final linking stage.		// output sections for them to allow merging at final linking stage.
uintX_t Alignment = 0;		uintX_t Alignment = 0;
if (isa<MergeInputSection<ELFT>>(C) \|\|		if (isa<MergeInputSection<ELFT>>(C) \|\|
		silvasUnsubmitted Not Done Reply Inline Actions Please pad this so that there isn't false sharing. DenseMap is smaller than a cacheline IIRC and so currently different threads will have false sharing. silvas: Please pad this so that there isn't false sharing. DenseMap is smaller than a cacheline IIRC…
		ruiuAuthorUnsubmitted Not Done Reply Inline Actions Done. It actually reduced the latency of this function by almost 10%. Wow. ruiu: Done. It actually reduced the latency of this function by almost 10%. Wow.
(Config->Relocatable && (C->Flags & SHF_MERGE)))		(Config->Relocatable && (C->Flags & SHF_MERGE)))
Alignment = std::max<uintX_t>(C->Alignment, C->Entsize);		Alignment = std::max<uintX_t>(C->Alignment, C->Entsize);

return SectionKey<ELFT::Is64Bits>{OutsecName, C->Type, Flags, Alignment};		return SectionKey<ELFT::Is64Bits>{OutsecName, C->Type, Flags, Alignment};
}		}

template <class ELFT>		template <class ELFT>
std::pair<OutputSectionBase *, bool>		std::pair<OutputSectionBase *, bool>
OutputSectionFactory<ELFT>::create(InputSectionBase<ELFT> *C,		OutputSectionFactory<ELFT>::create(InputSectionBase<ELFT> *C,
StringRef OutsecName) {		StringRef OutsecName) {
		silvasUnsubmitted Not Done Reply Inline Actions One interesting thing about this patch is that it shows that visiting all the pieces is not too expensive. This suggests that maybe doing some amount of preprocessing on the pieces in order to improve locality for the other parts could be worth it. silvas: One interesting thing about this patch is that it shows that visiting all the pieces is not too…
SectionKey<ELFT::Is64Bits> Key = createKey(C, OutsecName);		SectionKey<ELFT::Is64Bits> Key = createKey(C, OutsecName);
return create(Key, C);		return create(Key, C);
}		}

template <class ELFT>		template <class ELFT>
std::pair<OutputSectionBase *, bool>		std::pair<OutputSectionBase *, bool>
OutputSectionFactory<ELFT>::create(const SectionKey<ELFT::Is64Bits> &Key,		OutputSectionFactory<ELFT>::create(const SectionKey<ELFT::Is64Bits> &Key,
InputSectionBase<ELFT> *C) {		InputSectionBase<ELFT> *C) {
▲ Show 20 Lines • Show All 83 Lines • Show Last 20 Lines

test/ELF/comment-gc.s

	# REQUIRES: x86			# REQUIRES: x86
	# RUN: llvm-mc -filetype=obj -triple=x86_64-unknown-linux %s -o %t.o			# RUN: llvm-mc -filetype=obj -triple=x86_64-unknown-linux %s -o %t.o
	# RUN: llvm-mc -filetype=obj -triple=x86_64-unknown-linux %p/Inputs/comment-gc.s -o %t2.o			# RUN: llvm-mc -filetype=obj -triple=x86_64-unknown-linux %p/Inputs/comment-gc.s -o %t2.o
	# RUN: ld.lld %t.o %t2.o -o %t1 --gc-sections -shared			# RUN: ld.lld %t.o %t2.o -o %t1 --gc-sections -shared
	# RUN: llvm-objdump -s %t1 \| FileCheck %s			# RUN: llvm-objdump -s %t1 \| FileCheck %s

	# CHECK: Contents of section .comment:			# CHECK: Contents of section .comment:
	# CHECK-NEXT: 0000 00666f6f 00626172 004c4c44 20312e30 .foo.bar.LLD 1.0			# CHECK-NEXT: 0000 62617200 4c4c4420 312e3000 666f6f00 bar.LLD 1.0.foo.
	# CHECK-NEXT: 0010 00 .			# CHECK-NEXT: 0010 00 .

	.ident "foo"			.ident "foo"

	.globl _start			.globl _start
	_start:			_start:
	nop			nop

test/ELF/compressed-debug-input.s

	Show First 20 Lines • Show All 56 Lines • ▼ Show 20 Lines
	# DATA-NEXT: Offset: 0x1060			# DATA-NEXT: Offset: 0x1060
	# DATA-NEXT: Size: 69			# DATA-NEXT: Size: 69
	# DATA-NEXT: Link: 0			# DATA-NEXT: Link: 0
	# DATA-NEXT: Info: 0			# DATA-NEXT: Info: 0
	# DATA-NEXT: AddressAlignment: 1			# DATA-NEXT: AddressAlignment: 1
	# DATA-NEXT: EntrySize: 1			# DATA-NEXT: EntrySize: 1
	# DATA-NEXT: SectionData (			# DATA-NEXT: SectionData (
	# DATA-NEXT: 0000: 73686F72 7420756E 7369676E 65642069 \|short unsigned i\|			# DATA-NEXT: 0000: 73686F72 7420756E 7369676E 65642069 \|short unsigned i\|
	# DATA-NEXT: 0010: 6E740075 6E736967 6E656420 696E7400 \|nt.unsigned int.\|			# DATA-NEXT: 0010: 6E740063 68617200 756E7369 676E6564 \|nt.char.unsigned\|
	# DATA-NEXT: 0020: 6C6F6E67 20756E73 69676E65 6420696E \|long unsigned in\|			# DATA-NEXT: 0020: 20636861 7200756E 7369676E 65642069 \| char.unsigned i\|
	# DATA-NEXT: 0030: 74006368 61720075 6E736967 6E656420 \|t.char.unsigned \|			# DATA-NEXT: 0030: 6E74006C 6F6E6720 756E7369 676E6564 \|nt.long unsigned\|
	# DATA-NEXT: 0040: 63686172 00 \|char.\|			# DATA-NEXT: 0040: 20696E74 00 \| int.\|
	# DATA-NEXT: )			# DATA-NEXT: )
	# DATA-NEXT: }			# DATA-NEXT: }

	.section .debug_str,"MS",@progbits,1			.section .debug_str,"MS",@progbits,1
	.LASF2:			.LASF2:
	.string "short unsigned int"			.string "short unsigned int"
	.LASF3:			.LASF3:
	.string "unsigned int"			.string "unsigned int"
	.LASF0:			.LASF0:
	.string "long unsigned int"			.string "long unsigned int"
	.LASF8:			.LASF8:
	.string "char"			.string "char"
	.LASF1:			.LASF1:
	.string "unsigned char"			.string "unsigned char"

test/ELF/debug-gc.s

	# REQUIRES: x86			# REQUIRES: x86
	# RUN: llvm-mc -filetype=obj -triple=x86_64-unknown-linux %s -o %t.o			# RUN: llvm-mc -filetype=obj -triple=x86_64-unknown-linux %s -o %t.o
	# RUN: ld.lld %t.o -o %t1 --gc-sections			# RUN: ld.lld %t.o -o %t1 --gc-sections
	# RUN: llvm-objdump -s %t1 \| FileCheck %s			# RUN: llvm-objdump -s %t1 \| FileCheck %s

	# CHECK: Contents of section .debug_str:			# CHECK: Contents of section .debug_str:
	# CHECK-NEXT: 0000 41414100 42424200 43434300 AAA.BBB.CCC.			# CHECK-NEXT: 0000 41414100 43434300 42424200 AAA.CCC.BBB.
	# CHECK: Contents of section .foo:			# CHECK: Contents of section .foo:
	# CHECK-NEXT: 0000 2a000000			# CHECK-NEXT: 0000 2a000000
	# CHECK: Contents of section .debug_info:			# CHECK: Contents of section .debug_info:
	# CHECK-NEXT: 0000 00000000 04000000			# CHECK-NEXT: 0000 00000000 08000000

	.globl _start			.globl _start
	_start:			_start:

	.section .debug_str,"MS",@progbits,1			.section .debug_str,"MS",@progbits,1
	.Linfo_string0:			.Linfo_string0:
	.asciz "AAA"			.asciz "AAA"
	.Linfo_string1:			.Linfo_string1:
	Show All 11 Lines

test/ELF/gc-sections-merge.s

	Show All 14 Lines
	// CHECK-NEXT: Address:			// CHECK-NEXT: Address:
	// CHECK-NEXT: Offset:			// CHECK-NEXT: Offset:
	// CHECK-NEXT: Size: 8			// CHECK-NEXT: Size: 8
	// CHECK-NEXT: Link: 0			// CHECK-NEXT: Link: 0
	// CHECK-NEXT: Info: 0			// CHECK-NEXT: Info: 0
	// CHECK-NEXT: AddressAlignment: 1			// CHECK-NEXT: AddressAlignment: 1
	// CHECK-NEXT: EntrySize: 1			// CHECK-NEXT: EntrySize: 1
	// CHECK-NEXT: SectionData (			// CHECK-NEXT: SectionData (
	// CHECK-NEXT: 0000: 666F6F00 62617200 \|foo.bar.\|			// CHECK-NEXT: 0000: 62617200 666F6F00 \|bar.foo.\|
	// CHECK-NEXT: )			// CHECK-NEXT: )

	// GC: Name: .rodata			// GC: Name: .rodata
	// GC-NEXT: Type: SHT_PROGBITS			// GC-NEXT: Type: SHT_PROGBITS
	// GC-NEXT: Flags [			// GC-NEXT: Flags [
	// GC-NEXT: SHF_ALLOC			// GC-NEXT: SHF_ALLOC
	// GC-NEXT: SHF_MERGE			// GC-NEXT: SHF_MERGE
	// GC-NEXT: SHF_STRINGS			// GC-NEXT: SHF_STRINGS
	Show All 30 Lines

test/ELF/merge-string-align.s

	Show All 24 Lines
	// CHECK-NEXT: Address:			// CHECK-NEXT: Address:
	// CHECK-NEXT: Offset:			// CHECK-NEXT: Offset:
	// CHECK-NEXT: Size: 20			// CHECK-NEXT: Size: 20
	// CHECK-NEXT: Link: 0			// CHECK-NEXT: Link: 0
	// CHECK-NEXT: Info: 0			// CHECK-NEXT: Info: 0
	// CHECK-NEXT: AddressAlignment: 16			// CHECK-NEXT: AddressAlignment: 16
	// CHECK-NEXT: EntrySize:			// CHECK-NEXT: EntrySize:
	// CHECK-NEXT: SectionData (			// CHECK-NEXT: SectionData (
	// CHECK-NEXT: 0000: 666F6F00 00000000 00000000 00000000 \|foo.............\|			// CHECK-NEXT: 0000: 62617200 00000000 00000000 00000000 \|bar.............\|
	// CHECK-NEXT: 0010: 62617200 \|bar.\|			// CHECK-NEXT: 0010: 666F6F00 \|foo.\|
	// CHECK-NEXT: )			// CHECK-NEXT: )

	.section .rodata.str1.1,"aMS",@progbits,1			.section .rodata.str1.1,"aMS",@progbits,1
	.asciz "foo"			.asciz "foo"

	// CHECK: Name: .rodata			// CHECK: Name: .rodata
	// CHECK-NEXT: Type: SHT_PROGBITS			// CHECK-NEXT: Type: SHT_PROGBITS
	// CHECK-NEXT: Flags [			// CHECK-NEXT: Flags [
	Show All 14 Lines

test/ELF/merge.s

	Show All 25 Lines
	// CHECK-NEXT: Address: 0x200120			// CHECK-NEXT: Address: 0x200120
	// CHECK-NEXT: Offset: 0x120			// CHECK-NEXT: Offset: 0x120
	// CHECK-NEXT: Size: 8			// CHECK-NEXT: Size: 8
	// CHECK-NEXT: Link: 0			// CHECK-NEXT: Link: 0
	// CHECK-NEXT: Info: 0			// CHECK-NEXT: Info: 0
	// CHECK-NEXT: AddressAlignment: 4			// CHECK-NEXT: AddressAlignment: 4
	// CHECK-NEXT: EntrySize: 4			// CHECK-NEXT: EntrySize: 4
	// CHECK-NEXT: SectionData (			// CHECK-NEXT: SectionData (
	// CHECK-NEXT: 0000: 10000000 42000000			// CHECK-NEXT: 0000: 42000000 10000000
	// CHECK-NEXT: )			// CHECK-NEXT: )


	// Address of the constant 0x10 = 0x200120 = 2097440			// Address of the constant 0x10 = 0x200124 = 2097444
	// Address of the constant 0x42 = 0x200124 = 2097444			// Address of the constant 0x42 = 0x200120 = 2097440

	// CHECK: Symbols [			// CHECK: Symbols [

	// CHECK: Name: bar			// CHECK: Name: bar
	// CHECK-NEXT: Value: 0x200124			// CHECK-NEXT: Value: 0x200120
	// CHECK-NEXT: Size: 0			// CHECK-NEXT: Size: 0
	// CHECK-NEXT: Binding: Loca			// CHECK-NEXT: Binding: Loca
	// CHECK-NEXT: Type: None			// CHECK-NEXT: Type: None
	// CHECK-NEXT: Other: 0			// CHECK-NEXT: Other: 0
	// CHECK-NEXT: Section: .mysec			// CHECK-NEXT: Section: .mysec

	// CHECK: Name: zed			// CHECK: Name: zed
	// CHECK-NEXT: Value: 0x200124			// CHECK-NEXT: Value: 0x200120
	// CHECK-NEXT: Size: 0			// CHECK-NEXT: Size: 0
	// CHECK-NEXT: Binding: Local			// CHECK-NEXT: Binding: Local
	// CHECK-NEXT: Type: None			// CHECK-NEXT: Type: None
	// CHECK-NEXT: Other: 0			// CHECK-NEXT: Other: 0
	// CHECK-NEXT: Section: .mysec			// CHECK-NEXT: Section: .mysec

	// CHECK: Name: foo			// CHECK: Name: foo
	// CHECK-NEXT: Value: 0x200124			// CHECK-NEXT: Value: 0x200120
	// CHECK-NEXT: Size: 0			// CHECK-NEXT: Size: 0
	// CHECK-NEXT: Binding: Local			// CHECK-NEXT: Binding: Local
	// CHECK-NEXT: Type: None			// CHECK-NEXT: Type: None
	// CHECK-NEXT: Other [ (0x2)			// CHECK-NEXT: Other [ (0x2)
	// CHECK-NEXT: STV_HIDDEN			// CHECK-NEXT: STV_HIDDEN
	// CHECK-NEXT: ]			// CHECK-NEXT: ]
	// CHECK-NEXT: Section: .mysec			// CHECK-NEXT: Section: .mysec

	// CHECK: ]			// CHECK: ]

	.text			.text
	.globl _start			.globl _start
	_start:			_start:
	// DISASM: Disassembly of section .text:			// DISASM: Disassembly of section .text:
	// DISASM-NEXT: _start:			// DISASM-NEXT: _start:

	movl .mysec, %eax			movl .mysec, %eax
	// addr(0x10) = 2097440			// addr(0x10) = 2097444
	// DISASM-NEXT: movl 2097440, %eax			// DISASM-NEXT: movl 2097444, %eax

	movl .mysec+7, %eax			movl .mysec+7, %eax
	// addr(0x42) + 3 = 2097444 + 3 = 2097447			// addr(0x42) + 3 = 2097440 + 3 = 2097443
	// DISASM-NEXT: movl 2097447, %eax			// DISASM-NEXT: movl 2097443, %eax

	movl .mysec+8, %eax			movl .mysec+8, %eax
	// addr(0x42) = 2097444			// addr(0x42) = 2097440
	// DISASM-NEXT: movl 2097444, %eax			// DISASM-NEXT: movl 2097440, %eax

	movl bar+7, %eax			movl bar+7, %eax
	// addr(0x42) + 7 = 2097444 + 7 = 2097451			// addr(0x42) + 7 = 2097440 + 7 = 2097447
	// DISASM-NEXT: movl 2097451, %eax			// DISASM-NEXT: movl 2097447, %eax

	movl bar+8, %eax			movl bar+8, %eax
	// addr(0x42) + 8 = 2097444 + 8 = 2097452			// addr(0x42) + 8 = 2097440 + 8 = 2097448
	// DISASM-NEXT: movl 2097452, %eax			// DISASM-NEXT: movl 2097448, %eax

	movl foo, %eax			movl foo, %eax
	// addr(0x42) = 2097444			// addr(0x42) = 2097440
	// DISASM-NEXT: movl 2097444, %eax			// DISASM-NEXT: movl 2097440, %eax

	movl foo+7, %eax			movl foo+7, %eax
	// addr(0x42) + 7 = = 2097444 + 7 = 2097451			// addr(0x42) + 7 = = 2097440 + 7 = 2097447
	// DISASM-NEXT: movl 2097451, %eax			// DISASM-NEXT: movl 2097447, %eax

	movl foo+8, %eax			movl foo+8, %eax
	// addr(0x42) + 8 = = 2097444 + 8 = 2097452			// addr(0x42) + 8 = = 2097440 + 8 = 2097448
	// DISASM-NEXT: movl 2097452, %eax			// DISASM-NEXT: movl 2097448, %eax

	// From the other file: movl .mysec, %eax			// From the other file: movl .mysec, %eax
	// addr(0x42) = 2097444			// addr(0x42) = 2097440
	// DISASM-NEXT: movl 2097444, %eax			// DISASM-NEXT: movl 2097440, %eax

This is an archive of the discontinued LLVM Phabricator instance.

Merge strings using sharded hash tables.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 80964

ELF/InputSection.h

ELF/OutputSections.h

ELF/OutputSections.cpp

test/ELF/comment-gc.s

test/ELF/compressed-debug-input.s

test/ELF/debug-gc.s

test/ELF/gc-sections-merge.s

test/ELF/merge-string-align.s

test/ELF/merge.s

Merge strings using sharded hash tables.
ClosedPublic