This is an archive of the discontinued LLVM Phabricator instance.

Merge strings using concurrent hash map (3rd try!)
ClosedPublic

Authored by ruiu on Nov 27 2016, 4:32 PM.

Download Raw Diff

Details

Reviewers

silvas
• espindola
jfb

Summary

Here is yet another different implementation of string merging algorithm.
And this is faster than the previous two (https://reviews.llvm.org/D27146,
https://reviews.llvm.org/D27152).

ParallelStringTableBuilder implemented in this patch is a concurrent
hash table specialized for string table creation. It doesn't support
resizing, and you cannot do anything other than inserting strings into
the builder and write the string down to a buffer. By limiting use case,
a concurrent hash table can be implemented fairly easily. (Generally it
is extremely hard.)

This algorithm creates optimized string table in terms of size, and
the output is deterministic.

The internal hash table is an open-addressing hash table, so conflicts
are resolved by using next empty buckets. That brings in nondeterminism.
If two threads tries to claim the same bucket, only one succeeds, and
the other gets next empty one. So the bucket order is not deterministic.

To fix the problem, we sort buckets after inserting all keys.
We don't need to sort the entire hash table as one unit. Instead,
we sort buckets for each streak of claimed buckets.

Here is the performance number. This is better than the probabilistic
algorithm (5.227 seconds) and the sharded hash table algorithm (5.666
seconds).

Before:

   36427.671361 task-clock (msec)         #    5.477 CPUs utilized            ( +-  1.34% )
        158,095 context-switches          #    0.004 M/sec                    ( +-  0.27% )
          6,165 cpu-migrations            #    0.169 K/sec                    ( +- 21.57% )
      2,365,415 page-faults               #    0.065 M/sec                    ( +-  0.18% )
100,831,590,020 cycles                    #    2.768 GHz                      ( +-  1.32% )
 81,880,778,356 stalled-cycles-frontend   #   81.21% frontend cycles idle     ( +-  1.55% )
<not supported> stalled-cycles-backend
 45,993,420,294 instructions              #    0.46  insns per cycle
                                          #    1.78  stalled cycles per insn  ( +-  0.17% )
  8,913,176,489 branches                  #  244.681 M/sec                    ( +-  0.28% )
    148,952,459 branch-misses             #    1.67% of all branches          ( +-  0.10% )

    6.651371241 seconds time elapsed                                          ( +-  0.80% )

After:

   46385.337835 task-clock (msec)         #    8.869 CPUs utilized            ( +-  1.14% )
        170,016 context-switches          #    0.004 M/sec                    ( +-  0.39% )
          7,903 cpu-migrations            #    0.170 K/sec                    ( +- 19.36% )
      2,302,650 page-faults               #    0.050 M/sec                    ( +-  0.08% )
128,744,691,817 cycles                    #    2.776 GHz                      ( +-  1.13% )
109,140,318,510 stalled-cycles-frontend   #   84.77% frontend cycles idle     ( +-  1.23% )
<not supported> stalled-cycles-backend
 46,600,275,432 instructions              #    0.36  insns per cycle
                                          #    2.34  stalled cycles per insn  ( +-  0.65% )
  8,953,846,757 branches                  #  193.032 M/sec                    ( +-  1.04% )
    150,976,047 branch-misses             #    1.69% of all branches          ( +-  0.19% )

    5.230174248 seconds time elapsed                                          ( +-  0.69% )

Diff Detail

Build Status

Buildable 1626
Build 1626: arc lint + arc unit

Event Timeline

ruiu updated this revision to Diff 79364.Nov 27 2016, 4:32 PM

ruiu retitled this revision from to Merge strings using concurrent hash map (3rd try!).

ruiu updated this object.

ruiu added a reviewer: silvas.

ruiu added a subscriber: llvm-commits.

After a nonstop whole day hacking, I somehow managed to get deterministic
output from the concurrent hash table.

The new number is 5.230 second. This is slightly slower than the previous
nondeterministic implementation (5.136 seconds), but that is not bad at all.
That is about the same as the probabilistic algorithm (5.227 seconds) and
faster than the sharded hash table algorithm (5.666 seconds).

Taking the fact that this produces deterministic, minimal output into account,
this is definitely the best algorithm so far.

The downside is that the implementation is now a bit tricky. I don't think
this is hard to read, but it's undeniably more complicated than the original,
single-threaded implementation. I believe that's acceptable in this case
though.

ruiu updated this object.Nov 27 2016, 8:39 PM

Wow, great work. I think I've convinced myself that this will be deterministic. The crucial observation is that with linear probing, the "streaks" (strings of consecutive occupied buckets) are always the same regardless of insertion order.

How much extra performance is there for sorting the streaks individually like you do in this patch? We could use a single call to std::remove_if to coalesce all the non-empty buckets and a single call to std::sort to get a deterministic order, which would be simpler. If that isn't too much of a performance cost, it would be simpler to do that.

ELF/OutputSections.cpp
475	This is extremely troubling as it implies that we need to guarantee that we have chosen the size large enough or else LLD will have undefined behavior.
494	This property depends critically on the linear probing. This would not hold if we used quadratic probing. It would be good to mention that.
533	Won't this end up smaller for 32-bit hosts?
579	This must be unreachable, or it must somehow signal this so that the calling code can retry the string table building with a bigger size. We can't have LLD fail to link because a user's object files don't have enough duplicate strings.

I will address review comments tomorrow, but here is a breakdown. When merging 29,313,742 strings, we spent

185,729 us to insert them into the concurrent hash table,
70,814 us to sort runs of claimed buckets,
66,560 us to assign string table offsets to buckets, and
124,903 us seconds to update OutputOff member for all SectionPieces.

I think the algorithm is correct, but this patch is not ready for commit yet. As you said, we need to handle the case that the hash table becomes full. (My rough idea is when it becomes full, discard the entire hash table and redo from scratch with a larger hash table. Resizing in-use concurrent hash table is extremely hard.)

Move the class to Concurrent.{h,cpp}
Handle the case when the table becomes full

Herald added a subscriber: mgorny. · View Herald TranscriptNov 28 2016, 9:16 AM

As to whether we should do std::remove_if and then call std::sort only once or not, I think we shouldn't do that. Sorting is O(n log n), so you don't want to make n larger by merging streaks that could be sorted independently.

I don't think that it makes sense to put this in a file with a generic name like "Concurrent", as this is a quite specialized data structure that depends on specific LLD types. Maybe just call it ConcurrentStringTableBuilder.h?
I'm a bit worried about the layering too, does this class introduce any circular dependencies?

ELF/Concurrent.cpp
24 ↗	(On Diff #79414)	Why double it? Also, you have a similar std::max calculation in MergeOutputSection<ELFT>::finalize. Do you need it in both places?
26 ↗	(On Diff #79414)	Can you use a std::unique_ptr<EntryTy[]> to manage this?
73 ↗	(On Diff #79414)	I think this technically needs to be atomic to avoid races.
ELF/OutputSections.cpp
772	I'm very concerned that there may be user scenarios where we end up almost always needing multiple trips through this loop (e.g., maybe "most" non-debug builds end up with only 2 duplicates on average?). How will we determine if this is the case? Otherwise, this "optimization" may backfire and users in the wild will get slower links; we need to have some feedback loop to correct this if it is the case, or be really confident that this optimization will always speed things up. I would propose that we do a run of Poudriere or Debian or Gentoo with this patch applied and treat the resize case as a fatal error (and ideally the error message would also give the number of resizes). That way we can get an idea of how common this is.

Sorry for the delay.

ELF/Concurrent.cpp
53 ↗	(On Diff #79414)	I assume the idea here is to identify as early as possible whether we underestimated the number of buckets. I like this idea. Do you have any justification for the choice of numbers? Why is this better than just waiting for the table to become full? How much better? Should the check be 50%? (Knuth Vol 3 Sec 6.4 provides a good analysis for linearly probed open-addressed hash tables; the lookup costs will skyrocket after 75% load; is that your motivation? It would be good to have a comment) It may be better to instead track the duplication factor (or an approximation thereof). That way, even if we end up full this time, the caller can ask for the estimated duplication factor and get it right next time with high probability. This will probably be enough to avoid any worry about an excessive number of retries (with the current 10x duplication factor assumption, we might need up to 4 retries with doubling like you have it now; with an estimate, we could make it max of 2 with high probability). Another possibility is Also, I would like to see a comment justifying why it is deterministic whether or not IsTableFull is set to true. Note that if it is nondeterministic then LLD's output will be nondeterministic, so it is important to get this right.
ELF/Concurrent.h
34 ↗	(On Diff #79414)	Using how many cores?
57 ↗	(On Diff #79414)	Please mention that this is crucially dependent on the linear probing strategy. Also, I would add a scary comment on the loop that does the probing saying that it must be kept as linear probing.

ruiu added inline comments.Nov 30 2016, 12:55 PM

ELF/Concurrent.cpp
24 ↗	(On Diff #79414)	Because my target load factor is 0.5. I chose this number because I don't want to exceeds 0.75 at which point open-addressing hash table gets much slower Even if it can contain all given strings, a long streak of occupied buckets would make the hash table slower because of std::sort. In order to tune this formula, I need to run this against various programs, but that's not that much important at this moment. I'd do that later.
26 ↗	(On Diff #79414)	Changed to use std::vector.
53 ↗	(On Diff #79414)	I do not have precise reasoning for these questions, I can give my guts. These should be room to optimize it, but we can do that later. Why it is 0.75 and not 0.5? I think the load factor 0.5 is too early to give up. When we give up, we need to create a new table and restart. We should continue using the same table until it approaches to much higher load. 0.75 seems like a good cutoff because beyond that it gets really slow. We should keep track the duplication factor, shouldn't we? We should if we could. But currently we stop adding items to the hash table when it gets "full", so once it becomes full, we don't know whether remaining strings are duplicate or not. I added comment about the load factor, and simplified code for isFull(). I hope this makes things clear.
73 ↗	(On Diff #79414)	Yeah, I was tempted to say that this is a benign race, but by experts there is no such thing like "benign race". Fixed.
ELF/OutputSections.cpp
772	I added code here to get better estimation. Now we keep track an approximate number N of successfully inserted items. If N < NumPieces, only N / NumPieces were inserted, so we need to enlarge the table by the inverse, namely NumPieces / N.

Address review comments.

silvas added inline comments.Nov 30 2016, 9:37 PM

ELF/OutputSections.cpp
763	Where is the accompanying test? Also, please make the name more specific.

Address review comments.

LGTM. Thanks for working on this!

This revision is now accepted and ready to land.Dec 4 2016, 2:24 PM

I'm struggling to improve single-core performance of this patch. It scales well, but it's single-core performance sucks. This is a table to link time of clang with debug info (unit is second). As you can see, you need at least 4 cores to take advantage of this patch.

` # of cores Before After

 1   13.462   17.048   +21.03%
 2    9.766   10.902   +10.42%
 4    7.697    6.935   -10.98%
 8    6.888    5.674   -21.39%
12    7.073    5.812   -21.69%
16    7.066    5.569   -26.88%
20    6.846    5.226   -30.99%`

I tried to optimize it, but because it fundamentally does more thing than the simple hash table approach, it is almost impossible to compete with the original algorithm (that said I think this is too slow though).

We cannot make the linker use this algorithm only when it detects 4 or more cores because a choice of algorithm affects layout of mergeable output sections. We want to get deterministic outputs for the same input regardless how many processors are available on a computer.

I started thinking that the second, sharded algorithm may be better than this one because, even though it doesn't scale like this algorithm, it's single-core performance is not that bad. I'll update the patch with performance numbers.

I'm sorry for the back-and-force.

ruiu mentioned this in D27152: Merge strings using sharded hash tables..Dec 5 2016, 11:21 PM

we landed a different version of this IIRC. Closing diff.

Herald added a reviewer: • espindola. · View Herald TranscriptMar 25 2020, 6:33 PM

Herald added a reviewer: jfb. · View Herald Transcript

Herald added subscribers: jfb, mgrang, MaskRay, emaste. · View Herald Transcript

Revision Contents

Path

Size

ELF/

InputSection.h

16 lines

OutputSections.h

4 lines

OutputSections.cpp

240 lines

Diff 79367

ELF/InputSection.h

	Show First 20 Lines • Show All 154 Lines • ▼ Show 20 Lines
	};			};

	// SectionPiece represents a piece of splittable section contents.			// SectionPiece represents a piece of splittable section contents.
	// We allocate a lot of these and binary search on them. This means that they			// We allocate a lot of these and binary search on them. This means that they
	// have to be as compact as possible, which is why we don't store the size (can			// have to be as compact as possible, which is why we don't store the size (can
	// be found by looking at the next one) and put the hash in a side table.			// be found by looking at the next one) and put the hash in a side table.
	struct SectionPiece {			struct SectionPiece {
	SectionPiece(size_t Off, bool Live = false)			SectionPiece(size_t Off, bool Live = false)
	: InputOff(Off), OutputOff(-1), Live(Live \|\| !Config->GcSections) {}			: InputOff(Off), Live(Live \|\| !Config->GcSections), OutputOff(-1) {}

	size_t InputOff;			uint64_t InputOff : 40;
	ssize_t OutputOff : 8 * sizeof(ssize_t) - 1;			uint64_t Live : 1;
	size_t Live : 1;			union {
				int64_t OutputOff;
				SectionPiece *Next;
	};			};
	static_assert(sizeof(SectionPiece) == 2 * sizeof(size_t),			};
	"SectionPiece is too big");			static_assert(sizeof(SectionPiece) == 16, "SectionPiece is too big");

	// This corresponds to a SHF_MERGE section of an input file.			// This corresponds to a SHF_MERGE section of an input file.
	template <class ELFT> class MergeInputSection : public InputSectionBase<ELFT> {			template <class ELFT> class MergeInputSection : public InputSectionBase<ELFT> {
	typedef typename ELFT::uint uintX_t;			typedef typename ELFT::uint uintX_t;
	typedef typename ELFT::Sym Elf_Sym;			typedef typename ELFT::Sym Elf_Sym;
	typedef typename ELFT::Shdr Elf_Shdr;			typedef typename ELFT::Shdr Elf_Shdr;

	public:			public:
	▲ Show 20 Lines • Show All 133 Lines • Show Last 20 Lines

ELF/OutputSections.h

Show All 15 Lines
#include "lld/Core/LLVM.h"		#include "lld/Core/LLVM.h"
#include "llvm/MC/StringTableBuilder.h"		#include "llvm/MC/StringTableBuilder.h"
#include "llvm/Object/ELF.h"		#include "llvm/Object/ELF.h"

namespace lld {		namespace lld {
namespace elf {		namespace elf {

class SymbolBody;		class SymbolBody;
		class ParallelStringTableBuilder;
struct EhSectionPiece;		struct EhSectionPiece;
template <class ELFT> class EhInputSection;		template <class ELFT> class EhInputSection;
template <class ELFT> class InputSection;		template <class ELFT> class InputSection;
template <class ELFT> class InputSectionBase;		template <class ELFT> class InputSectionBase;
template <class ELFT> class MergeInputSection;		template <class ELFT> class MergeInputSection;
template <class ELFT> class OutputSection;		template <class ELFT> class OutputSection;
template <class ELFT> class ObjectFile;		template <class ELFT> class ObjectFile;
template <class ELFT> class SharedFile;		template <class ELFT> class SharedFile;
▲ Show 20 Lines • Show All 106 Lines • ▼ Show 20 Lines	public:
static bool classof(const OutputSectionBase *B) {		static bool classof(const OutputSectionBase *B) {
return B->getKind() == Merge;		return B->getKind() == Merge;
}		}

private:		private:
void finalizeTailMerge();		void finalizeTailMerge();
void finalizeNoTailMerge();		void finalizeNoTailMerge();

		ParallelStringTableBuilder *ParallelBuilder = nullptr;

llvm::StringTableBuilder Builder;		llvm::StringTableBuilder Builder;
std::vector<MergeInputSection<ELFT> *> Sections;		std::vector<MergeInputSection<ELFT> *> Sections;
		size_t StringAlignment;
};		};

struct CieRecord {		struct CieRecord {
EhSectionPiece *Piece = nullptr;		EhSectionPiece *Piece = nullptr;
std::vector<EhSectionPiece *> FdePieces;		std::vector<EhSectionPiece *> FdePieces;
};		};

// Output section for .eh_frame.		// Output section for .eh_frame.
▲ Show 20 Lines • Show All 120 Lines • Show Last 20 Lines

ELF/OutputSections.cpp

Show All 15 Lines
#include "SymbolTable.h"		#include "SymbolTable.h"
#include "SyntheticSections.h"		#include "SyntheticSections.h"
#include "Target.h"		#include "Target.h"
#include "lld/Core/Parallel.h"		#include "lld/Core/Parallel.h"
#include "llvm/Support/Dwarf.h"		#include "llvm/Support/Dwarf.h"
#include "llvm/Support/MD5.h"		#include "llvm/Support/MD5.h"
#include "llvm/Support/MathExtras.h"		#include "llvm/Support/MathExtras.h"
#include "llvm/Support/SHA1.h"		#include "llvm/Support/SHA1.h"
		#include <atomic>

using namespace llvm;		using namespace llvm;
using namespace llvm::dwarf;		using namespace llvm::dwarf;
using namespace llvm::object;		using namespace llvm::object;
using namespace llvm::support::endian;		using namespace llvm::support::endian;
using namespace llvm::ELF;		using namespace llvm::ELF;

using namespace lld;		using namespace lld;
▲ Show 20 Lines • Show All 429 Lines • ▼ Show 20 Lines	for (CieRecord *Cie : Cies) {
uintX_t Pc = getFdePc(Buf, Fde->OutputOff, Enc);		uintX_t Pc = getFdePc(Buf, Fde->OutputOff, Enc);
uintX_t FdeVA = this->Addr + Fde->OutputOff;		uintX_t FdeVA = this->Addr + Fde->OutputOff;
In<ELFT>::EhFrameHdr->addFde(Pc, FdeVA);		In<ELFT>::EhFrameHdr->addFde(Pc, FdeVA);
}		}
}		}
}		}
}		}

		// This is a thread-safe string table builder. Internally it uses
		// atomic variables to keep the inserted strings and associated
		// SectionPieces.
		//
		// The internal hash table is an open-addressing one. It doesn't
		// support resizing. Once it becomes full, the behavior is undefined.
		silvasUnsubmitted Done Reply Inline Actions This is extremely troubling as it implies that we need to guarantee that we have chosen the size large enough or else LLD will have undefined behavior. silvas: This is extremely troubling as it implies that we need to guarantee that we have chosen the…
		//
		// Just like DenseMap, keys and values are directory stored to buckets
		// as pairs. Originally, keys and values are all null. Keys are
		// pointers to strings.
		//
		// SectionPieces are managed using singly linked list. If a bucket
		// have a value, it has a pointer to a SectionPiece. Other
		// SectionPieces having the same string key can be followed by
		// `Next` pointers of SectionPieces.
		//
		// Once you added all strings, you need to call finalize() to fix
		// string table contents. When finalize() is called, hash table
		// contents is nondeterministic because no one knows which thread
		// have claimed earlier buckets in the hash table.
		//
		// Thus, the next step is to make it deterministic. For a streak of
		// buckets having some values, we sort them. No matter what order
		// multiple threads claim buckets, claimed buckets are claimed, and
		// unclaimed buckets remain unclaimed. Therefore, by sortint buckets
		silvasUnsubmitted Done Reply Inline Actions This property depends critically on the linear probing. This would not hold if we used quadratic probing. It would be good to mention that. silvas: This property depends critically on the linear probing. This would not hold if we used…
		// for each streak, we can make the entire hash table deterministic.
		//
		// Finally, we assign offsets to all SectionPieces associated with
		// this hash table.
		class elf::ParallelStringTableBuilder {
		// Bucket entry type.
		struct EntryTy {
		// String key information. String key is guaranteed to be unique
		// in a hash table.
		std::atomic<const char *> Str;
		std::atomic<uint32_t> Size;

		// Offset in the string table. Filled by finalize().
		uint32_t Offset;

		// SectionPieces having the same string contents are chained
		// using pointers. This pointer points to the first node.
		std::atomic<SectionPiece *> Piece;

		EntryTy(const EntryTy &Other) { memcpy(this, &Other, sizeof(EntryTy)); }

		EntryTy &operator=(const EntryTy &Other) {
		memcpy(this, &Other, sizeof(EntryTy));
		return *this;
		}

		// Define a total order for string keys. The detail is not important,
		// but it needs to be deterministic, so that we can get deterministic
		// output from the hash table.
		bool operator<(const EntryTy &Other) const {
		size_t SizeA = Size.load();
		size_t SizeB = Other.Size.load();
		if (SizeA != SizeB)
		return SizeA < SizeB;
		return memcmp(Str.load(), Other.Str.load(), SizeA) < 0;
		};
		};

		static_assert(sizeof(EntryTy) == 24, "EntryTy is too big");
		silvasUnsubmitted Done Reply Inline Actions Won't this end up smaller for 32-bit hosts? silvas: Won't this end up smaller for 32-bit hosts?

		size_t align2(size_t Val) { return (Val + Alignment - 1) & ~(Alignment - 1); }

		public:
		ParallelStringTableBuilder(size_t Size, size_t Alignment)
		: Alignment(Alignment),
		NumBuckets(std::max<size_t>(PowerOf2Ceil(Size), 16)) {
		Buckets = (EntryTy *)calloc(NumBuckets, sizeof(EntryTy));
		}

		~ParallelStringTableBuilder() { free(Buckets); }

		// Inserts a given section piece to the string table.
		void insert(SectionPiece &Piece, CachedHashStringRef Str) {
		assert(Str.size() != 0);

		size_t Start = Str.hash() & (NumBuckets - 1);

		for (size_t I = Start; I != Start - 1; I = (I + 1) & (NumBuckets - 1)) {
		EntryTy &Bucket = Buckets[I];

		// If the current bucket is empty, claim it.
		const char *Null = nullptr;
		if (Bucket.Str.compare_exchange_strong(Null, Str.val().data())) {
		Bucket.Size.store(Str.size());
		append(Bucket, Piece);
		return;
		}

		// The current bucket contains some string. Its size might not be
		// written by other thread yet, so try loading until it becomes
		// observable.
		const char *OldStr = Bucket.Str.load();
		uint64_t OldSize = Bucket.Size.load();
		while (OldSize == 0)
		OldSize = Bucket.Size.load();

		// If there is an existing key, append Piece to the bucket.
		// Otherwise, go to next bucket.
		if (Str.val() == StringRef(OldStr, OldSize)) {
		append(Bucket, Piece);
		return;
		}
		}

		error("table is full");
		silvasUnsubmitted Done Reply Inline Actions This must be unreachable, or it must somehow signal this so that the calling code can retry the string table building with a bigger size. We can't have LLD fail to link because a user's object files don't have enough duplicate strings. silvas: This must be unreachable, or it must somehow signal this so that the calling code can retry the…
		}

		void finalize() {
		// Sort buckets to make the hash table contents deterministic.
		sortBuckets();
		sortWrapAround();

		// Set an offset in the string table for each bucket.
		size_t Off = 0;
		for (size_t I = 0; I < NumBuckets; ++I) {
		if (size_t Size = Buckets[I].Size.load()) {
		Off = align2(Off);
		Buckets[I].Offset = Off;
		Off += Buckets[I].Size.load();
		}
		}
		StringTableSize = Off;

		// Update SectionPieces' offsets.
		parallel_for(size_t(0), NumBuckets, [&](size_t I) {
		if (SectionPiece *Cur = Buckets[I].Piece.load()) {
		for (SectionPiece *Next = Cur->Next; Next;
		Cur = Next, Next = Next->Next)
		Cur->OutputOff = Buckets[I].Offset;
		Cur->OutputOff = Buckets[I].Offset;
		}
		});
		}

		void writeTo(uint8_t *Buf) {
		for (size_t I = 0; I < NumBuckets; ++I) {
		if (const char *Str = Buckets[I].Str.load()) {
		size_t Size = Buckets[I].Size.load();
		memcpy(Buf + Buckets[I].Offset, Str, Size);
		}
		}
		}

		size_t size() const { return StringTableSize; }

		private:
		// Atomically append Piece to Bucket.
		void append(EntryTy &Bucket, SectionPiece &Piece) {
		for (;;) {
		Piece.Next = Bucket.Piece.load();
		if (Bucket.Piece.compare_exchange_strong(Piece.Next, &Piece))
		return;
		}
		}

		// Find runs of buckets having some values and sort them.
		void sortBuckets() {
		for (size_t I = 0; I < NumBuckets;) {
		if (Buckets[I].Size.load() == 0) {
		++I;
		continue;
		}
		size_t Begin = I;
		size_t End = I + 1;

		while (End < NumBuckets && Buckets[End].Size.load())
		++End;
		std::sort(Buckets + Begin, Buckets + End);
		I = End;
		}
		}

		// After the end of the buckets, it wraps around to the beginning,
		// so we sort them as one unit. sortBuckets() didn't handle this
		// corner case.
		void sortWrapAround() {
		if (Buckets[0].Size.load() == 0 \|\| Buckets[NumBuckets - 1].Size.load() == 0)
		return;

		size_t First = 0;
		while (Buckets[First].Size.load())
		++First;

		size_t Last = NumBuckets - 1;
		while (Buckets[Last - 1].Size.load())
		--Last;

		std::vector<EntryTy> Vec;
		Vec.insert(Vec.end(), Buckets, Buckets + First);
		Vec.insert(Vec.end(), Buckets + Last, Buckets + NumBuckets);

		std::sort(Vec.begin(), Vec.end());

		size_t LastSize = NumBuckets - Last;
		std::copy(Vec.begin(), Vec.begin() + LastSize, Buckets + Last);
		std::copy(Vec.begin() + LastSize, Vec.end(), Buckets);
		}

		size_t StringTableSize = 0;
		size_t Alignment;
		size_t NumBuckets;
		EntryTy *Buckets;
		};

template <class ELFT>		template <class ELFT>
MergeOutputSection<ELFT>::MergeOutputSection(StringRef Name, uint32_t Type,		MergeOutputSection<ELFT>::MergeOutputSection(StringRef Name, uint32_t Type,
uintX_t Flags, uintX_t Alignment)		uintX_t Flags, uintX_t Alignment)
: OutputSectionBase(Name, Type, Flags),		: OutputSectionBase(Name, Type, Flags),
Builder(StringTableBuilder::RAW, Alignment) {}		Builder(StringTableBuilder::RAW, Alignment), StringAlignment(Alignment) {}

template <class ELFT> void MergeOutputSection<ELFT>::writeTo(uint8_t *Buf) {		template <class ELFT> void MergeOutputSection<ELFT>::writeTo(uint8_t *Buf) {
		if (ParallelBuilder)
		ParallelBuilder->writeTo(Buf);
		else
Builder.write(Buf);		Builder.write(Buf);
}		}

template <class ELFT>		template <class ELFT>
void MergeOutputSection<ELFT>::addSection(InputSectionData *C) {		void MergeOutputSection<ELFT>::addSection(InputSectionData *C) {
auto *Sec = cast<MergeInputSection<ELFT>>(C);		auto *Sec = cast<MergeInputSection<ELFT>>(C);
Sec->OutSec = this;		Sec->OutSec = this;
this->updateAlignment(Sec->Alignment);		this->updateAlignment(Sec->Alignment);
this->Entsize = Sec->Entsize;		this->Entsize = Sec->Entsize;
Show All 35 Lines	for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I)
if (Sec->Pieces[I].Live)		if (Sec->Pieces[I].Live)
Sec->Pieces[I].OutputOff = Builder.add(Sec->getData(I));		Sec->Pieces[I].OutputOff = Builder.add(Sec->getData(I));

Builder.finalizeInOrder();		Builder.finalizeInOrder();
this->Size = Builder.getSize();		this->Size = Builder.getSize();
}		}

template <class ELFT> void MergeOutputSection<ELFT>::finalize() {		template <class ELFT> void MergeOutputSection<ELFT>::finalize() {
if (shouldTailMerge())		if (shouldTailMerge()) {
finalizeTailMerge();		finalizeTailMerge();
else		return;
finalizeNoTailMerge();		}

		size_t NumPieces = 0;
		for (MergeInputSection<ELFT> *Sec : Sections)
		NumPieces += Sec->Pieces.size();
		ParallelBuilder =
		new ParallelStringTableBuilder(NumPieces / 2, StringAlignment);

		parallel_for_each(
		Sections.begin(), Sections.end(), [&](MergeInputSection<ELFT> *Sec) {
		for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I)
		if (Sec->Pieces[I].Live)
		ParallelBuilder->insert(Sec->Pieces[I], Sec->getData(I));
		});

		ParallelBuilder->finalize();

		this->Size = ParallelBuilder->size();
}		}

		silvasUnsubmitted Done Reply Inline Actions Where is the accompanying test? Also, please make the name more specific. silvas: Where is the accompanying test? Also, please make the name more specific.
template <class ELFT>		template <class ELFT>
static typename ELFT::uint getOutFlags(InputSectionBase<ELFT> *S) {		static typename ELFT::uint getOutFlags(InputSectionBase<ELFT> *S) {
return S->Flags & ~SHF_GROUP & ~SHF_COMPRESSED;		return S->Flags & ~SHF_GROUP & ~SHF_COMPRESSED;
}		}

template <class ELFT>		template <class ELFT>
static SectionKey<ELFT::Is64Bits> createKey(InputSectionBase<ELFT> *C,		static SectionKey<ELFT::Is64Bits> createKey(InputSectionBase<ELFT> *C,
StringRef OutsecName) {		StringRef OutsecName) {
typedef typename ELFT::uint uintX_t;		typedef typename ELFT::uint uintX_t;
		silvasUnsubmitted Not Done Reply Inline Actions I'm very concerned that there may be user scenarios where we end up almost always needing multiple trips through this loop (e.g., maybe "most" non-debug builds end up with only 2 duplicates on average?). How will we determine if this is the case? Otherwise, this "optimization" may backfire and users in the wild will get slower links; we need to have some feedback loop to correct this if it is the case, or be really confident that this optimization will always speed things up. I would propose that we do a run of Poudriere or Debian or Gentoo with this patch applied and treat the resize case as a fatal error (and ideally the error message would also give the number of resizes). That way we can get an idea of how common this is. silvas: I'm very concerned that there may be user scenarios where we end up almost always needing…
		ruiuAuthorUnsubmitted Not Done Reply Inline Actions I added code here to get better estimation. Now we keep track an approximate number N of successfully inserted items. If N < NumPieces, only N / NumPieces were inserted, so we need to enlarge the table by the inverse, namely NumPieces / N. ruiu: I added code here to get better estimation. Now we keep track an approximate number N of…
uintX_t Flags = getOutFlags(C);		uintX_t Flags = getOutFlags(C);

// For SHF_MERGE we create different output sections for each alignment.		// For SHF_MERGE we create different output sections for each alignment.
// This makes each output section simple and keeps a single level mapping from		// This makes each output section simple and keeps a single level mapping from
// input to output.		// input to output.
// In case of relocatable object generation we do not try to perform merging		// In case of relocatable object generation we do not try to perform merging
// and treat SHF_MERGE sections as regular ones, but also create different		// and treat SHF_MERGE sections as regular ones, but also create different
// output sections for them to allow merging at final linking stage.		// output sections for them to allow merging at final linking stage.
▲ Show 20 Lines • Show All 103 Lines • Show Last 20 Lines