This is an archive of the discontinued LLVM Phabricator instance.

Merge strings using concurrent hash map (3rd try!)
ClosedPublic

Authored by ruiu on Nov 27 2016, 4:32 PM.

Download Raw Diff

Details

Reviewers

silvas
• espindola
jfb

Summary

Here is yet another different implementation of string merging algorithm.
And this is faster than the previous two (https://reviews.llvm.org/D27146,
https://reviews.llvm.org/D27152).

ParallelStringTableBuilder implemented in this patch is a concurrent
hash table specialized for string table creation. It doesn't support
resizing, and you cannot do anything other than inserting strings into
the builder and write the string down to a buffer. By limiting use case,
a concurrent hash table can be implemented fairly easily. (Generally it
is extremely hard.)

This algorithm creates optimized string table in terms of size, and
the output is deterministic.

The internal hash table is an open-addressing hash table, so conflicts
are resolved by using next empty buckets. That brings in nondeterminism.
If two threads tries to claim the same bucket, only one succeeds, and
the other gets next empty one. So the bucket order is not deterministic.

To fix the problem, we sort buckets after inserting all keys.
We don't need to sort the entire hash table as one unit. Instead,
we sort buckets for each streak of claimed buckets.

Here is the performance number. This is better than the probabilistic
algorithm (5.227 seconds) and the sharded hash table algorithm (5.666
seconds).

Before:

   36427.671361 task-clock (msec)         #    5.477 CPUs utilized            ( +-  1.34% )
        158,095 context-switches          #    0.004 M/sec                    ( +-  0.27% )
          6,165 cpu-migrations            #    0.169 K/sec                    ( +- 21.57% )
      2,365,415 page-faults               #    0.065 M/sec                    ( +-  0.18% )
100,831,590,020 cycles                    #    2.768 GHz                      ( +-  1.32% )
 81,880,778,356 stalled-cycles-frontend   #   81.21% frontend cycles idle     ( +-  1.55% )
<not supported> stalled-cycles-backend
 45,993,420,294 instructions              #    0.46  insns per cycle
                                          #    1.78  stalled cycles per insn  ( +-  0.17% )
  8,913,176,489 branches                  #  244.681 M/sec                    ( +-  0.28% )
    148,952,459 branch-misses             #    1.67% of all branches          ( +-  0.10% )

    6.651371241 seconds time elapsed                                          ( +-  0.80% )

After:

   46385.337835 task-clock (msec)         #    8.869 CPUs utilized            ( +-  1.14% )
        170,016 context-switches          #    0.004 M/sec                    ( +-  0.39% )
          7,903 cpu-migrations            #    0.170 K/sec                    ( +- 19.36% )
      2,302,650 page-faults               #    0.050 M/sec                    ( +-  0.08% )
128,744,691,817 cycles                    #    2.776 GHz                      ( +-  1.13% )
109,140,318,510 stalled-cycles-frontend   #   84.77% frontend cycles idle     ( +-  1.23% )
<not supported> stalled-cycles-backend
 46,600,275,432 instructions              #    0.36  insns per cycle
                                          #    2.34  stalled cycles per insn  ( +-  0.65% )
  8,953,846,757 branches                  #  193.032 M/sec                    ( +-  1.04% )
    150,976,047 branch-misses             #    1.69% of all branches          ( +-  0.19% )

    5.230174248 seconds time elapsed                                          ( +-  0.69% )

Diff Detail

Build Status

Buildable 1848
Build 1848: arc lint + arc unit

Event Timeline

ruiu updated this revision to Diff 79364.Nov 27 2016, 4:32 PM

ruiu retitled this revision from to Merge strings using concurrent hash map (3rd try!).

ruiu updated this object.

ruiu added a reviewer: silvas.

ruiu added a subscriber: llvm-commits.

After a nonstop whole day hacking, I somehow managed to get deterministic
output from the concurrent hash table.

The new number is 5.230 second. This is slightly slower than the previous
nondeterministic implementation (5.136 seconds), but that is not bad at all.
That is about the same as the probabilistic algorithm (5.227 seconds) and
faster than the sharded hash table algorithm (5.666 seconds).

Taking the fact that this produces deterministic, minimal output into account,
this is definitely the best algorithm so far.

The downside is that the implementation is now a bit tricky. I don't think
this is hard to read, but it's undeniably more complicated than the original,
single-threaded implementation. I believe that's acceptable in this case
though.

ruiu updated this object.Nov 27 2016, 8:39 PM

Wow, great work. I think I've convinced myself that this will be deterministic. The crucial observation is that with linear probing, the "streaks" (strings of consecutive occupied buckets) are always the same regardless of insertion order.

How much extra performance is there for sorting the streaks individually like you do in this patch? We could use a single call to std::remove_if to coalesce all the non-empty buckets and a single call to std::sort to get a deterministic order, which would be simpler. If that isn't too much of a performance cost, it would be simpler to do that.

ELF/OutputSections.cpp
473	This is extremely troubling as it implies that we need to guarantee that we have chosen the size large enough or else LLD will have undefined behavior.
492	This property depends critically on the linear probing. This would not hold if we used quadratic probing. It would be good to mention that.
531	Won't this end up smaller for 32-bit hosts?
577	This must be unreachable, or it must somehow signal this so that the calling code can retry the string table building with a bigger size. We can't have LLD fail to link because a user's object files don't have enough duplicate strings.

I will address review comments tomorrow, but here is a breakdown. When merging 29,313,742 strings, we spent

185,729 us to insert them into the concurrent hash table,
70,814 us to sort runs of claimed buckets,
66,560 us to assign string table offsets to buckets, and
124,903 us seconds to update OutputOff member for all SectionPieces.

I think the algorithm is correct, but this patch is not ready for commit yet. As you said, we need to handle the case that the hash table becomes full. (My rough idea is when it becomes full, discard the entire hash table and redo from scratch with a larger hash table. Resizing in-use concurrent hash table is extremely hard.)

Move the class to Concurrent.{h,cpp}
Handle the case when the table becomes full

Herald added a subscriber: mgorny. · View Herald TranscriptNov 28 2016, 9:16 AM

As to whether we should do std::remove_if and then call std::sort only once or not, I think we shouldn't do that. Sorting is O(n log n), so you don't want to make n larger by merging streaks that could be sorted independently.

I don't think that it makes sense to put this in a file with a generic name like "Concurrent", as this is a quite specialized data structure that depends on specific LLD types. Maybe just call it ConcurrentStringTableBuilder.h?
I'm a bit worried about the layering too, does this class introduce any circular dependencies?

ELF/Concurrent.cpp
25	Why double it? Also, you have a similar std::max calculation in MergeOutputSection<ELFT>::finalize. Do you need it in both places?
27	Can you use a std::unique_ptr<EntryTy[]> to manage this?
74	I think this technically needs to be atomic to avoid races.
ELF/OutputSections.cpp
629	I'm very concerned that there may be user scenarios where we end up almost always needing multiple trips through this loop (e.g., maybe "most" non-debug builds end up with only 2 duplicates on average?). How will we determine if this is the case? Otherwise, this "optimization" may backfire and users in the wild will get slower links; we need to have some feedback loop to correct this if it is the case, or be really confident that this optimization will always speed things up. I would propose that we do a run of Poudriere or Debian or Gentoo with this patch applied and treat the resize case as a fatal error (and ideally the error message would also give the number of resizes). That way we can get an idea of how common this is.

Sorry for the delay.

ELF/Concurrent.cpp
54	I assume the idea here is to identify as early as possible whether we underestimated the number of buckets. I like this idea. Do you have any justification for the choice of numbers? Why is this better than just waiting for the table to become full? How much better? Should the check be 50%? (Knuth Vol 3 Sec 6.4 provides a good analysis for linearly probed open-addressed hash tables; the lookup costs will skyrocket after 75% load; is that your motivation? It would be good to have a comment) It may be better to instead track the duplication factor (or an approximation thereof). That way, even if we end up full this time, the caller can ask for the estimated duplication factor and get it right next time with high probability. This will probably be enough to avoid any worry about an excessive number of retries (with the current 10x duplication factor assumption, we might need up to 4 retries with doubling like you have it now; with an estimate, we could make it max of 2 with high probability). Another possibility is Also, I would like to see a comment justifying why it is deterministic whether or not IsTableFull is set to true. Note that if it is nondeterministic then LLD's output will be nondeterministic, so it is important to get this right.
ELF/Concurrent.h
35	Using how many cores?
58	Please mention that this is crucially dependent on the linear probing strategy. Also, I would add a scary comment on the loop that does the probing saying that it must be kept as linear probing.

ruiu added inline comments.Nov 30 2016, 12:55 PM

ELF/Concurrent.cpp
25	Because my target load factor is 0.5. I chose this number because I don't want to exceeds 0.75 at which point open-addressing hash table gets much slower Even if it can contain all given strings, a long streak of occupied buckets would make the hash table slower because of std::sort. In order to tune this formula, I need to run this against various programs, but that's not that much important at this moment. I'd do that later.
27	Changed to use std::vector.
54	I do not have precise reasoning for these questions, I can give my guts. These should be room to optimize it, but we can do that later. Why it is 0.75 and not 0.5? I think the load factor 0.5 is too early to give up. When we give up, we need to create a new table and restart. We should continue using the same table until it approaches to much higher load. 0.75 seems like a good cutoff because beyond that it gets really slow. We should keep track the duplication factor, shouldn't we? We should if we could. But currently we stop adding items to the hash table when it gets "full", so once it becomes full, we don't know whether remaining strings are duplicate or not. I added comment about the load factor, and simplified code for isFull(). I hope this makes things clear.
74	Yeah, I was tempted to say that this is a benign race, but by experts there is no such thing like "benign race". Fixed.
ELF/OutputSections.cpp
629	I added code here to get better estimation. Now we keep track an approximate number N of successfully inserted items. If N < NumPieces, only N / NumPieces were inserted, so we need to enlarge the table by the inverse, namely NumPieces / N.

Address review comments.

silvas added inline comments.Nov 30 2016, 9:37 PM

ELF/OutputSections.cpp
592	Where is the accompanying test? Also, please make the name more specific.

Address review comments.

LGTM. Thanks for working on this!

This revision is now accepted and ready to land.Dec 4 2016, 2:24 PM

I'm struggling to improve single-core performance of this patch. It scales well, but it's single-core performance sucks. This is a table to link time of clang with debug info (unit is second). As you can see, you need at least 4 cores to take advantage of this patch.

` # of cores Before After

 1   13.462   17.048   +21.03%
 2    9.766   10.902   +10.42%
 4    7.697    6.935   -10.98%
 8    6.888    5.674   -21.39%
12    7.073    5.812   -21.69%
16    7.066    5.569   -26.88%
20    6.846    5.226   -30.99%`

I tried to optimize it, but because it fundamentally does more thing than the simple hash table approach, it is almost impossible to compete with the original algorithm (that said I think this is too slow though).

We cannot make the linker use this algorithm only when it detects 4 or more cores because a choice of algorithm affects layout of mergeable output sections. We want to get deterministic outputs for the same input regardless how many processors are available on a computer.

I started thinking that the second, sharded algorithm may be better than this one because, even though it doesn't scale like this algorithm, it's single-core performance is not that bad. I'll update the patch with performance numbers.

I'm sorry for the back-and-force.

ruiu mentioned this in D27152: Merge strings using sharded hash tables..Dec 5 2016, 11:21 PM

we landed a different version of this IIRC. Closing diff.

Herald added a reviewer: • espindola. · View Herald TranscriptMar 25 2020, 6:33 PM

Herald added a reviewer: jfb. · View Herald Transcript

Herald added subscribers: jfb, mgrang, MaskRay, emaste. · View Herald Transcript

Revision Contents

Path

Size

ELF/

1 line

142 lines

185 lines

9 lines

7 lines

81 lines

test/

ELF/

merge-strings-concurrent.s

25 lines

Diff 80209

ELF/CMakeLists.txt

	set(LLVM_TARGET_DEFINITIONS Options.td)			set(LLVM_TARGET_DEFINITIONS Options.td)
	tablegen(LLVM Options.inc -gen-opt-parser-defs)			tablegen(LLVM Options.inc -gen-opt-parser-defs)
	add_public_tablegen_target(ELFOptionsTableGen)			add_public_tablegen_target(ELFOptionsTableGen)

	add_lld_library(lldELF			add_lld_library(lldELF
				Concurrent.cpp
	Driver.cpp			Driver.cpp
	DriverUtils.cpp			DriverUtils.cpp
	EhFrame.cpp			EhFrame.cpp
	Error.cpp			Error.cpp
	GdbIndex.cpp			GdbIndex.cpp
	ICF.cpp			ICF.cpp
	InputFiles.cpp			InputFiles.cpp
	InputSection.cpp			InputSection.cpp
	▲ Show 20 Lines • Show All 45 Lines • Show Last 20 Lines

ELF/Concurrent.h

This file was added.

				//===- Concurrent.h -------------------------------------------------------===//
				//
				// The LLVM Linker
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLD_ELF_CONCURRENT_H
				#define LLD_ELF_CONCURRENT_H

				#include "lld/Core/LLVM.h"
				#include <atomic>

				namespace llvm {
				class CachedHashStringRef;
				}

				namespace lld {
				namespace elf {
				struct SectionPiece;

				// This is a concurrent lock-free string table builder. Internally
				// it uses atomic variables to keep inserted strings and associated
				// SectionPieces.
				//
				// The reason why we have a special-purpose concurrent hash table in
				// LLD is because we need to process a very large number of mergeable
				// strings. For example, in the final build of clang with debug info,
				// thousands of input sections containing 30 million mergeable strings
				// in total are fed to the linker. It takes a few seconds to merge
				// them in single thread. This concurrent hash table can uniquify them
				// in a few hundred milliseconds using 40 cores.
				//
				silvasUnsubmitted Done Reply Inline Actions Using how many cores? silvas: Using how many cores?
				// The internal hash table is an open-addressing one. It doesn't
				// support resizing. Once it becomes full, you need to redo with
				// a larger fresh string table builder.
				//
				// Just like DenseMap, keys and values are directly stored to buckets
				// as pairs. Originally, keys and values are all null. Keys are
				// pointers to strings.
				//
				// SectionPieces are managed using singly linked list. If a bucket
				// have a value, it has a pointer to a SectionPiece. Other
				// SectionPieces having the same string key are chained using `Next`
				// pointer of SectionPiece.
				//
				// Once you added all strings, you need to call finalize() to fix
				// string table contents. When finalize() is called, hash table
				// contents is nondeterministic because no one knows which thread
				// have claimed earlier buckets in the hash table.
				//
				// Thus, the next step is to make it deterministic. For each streak of
				// occupied buckets, we sort that. Recall that this hash table
				// resolves conflicts with linear probing. No matter what order
				// multiple threads claim buckets, claimed buckets are claimed, and
				// unclaimed buckets remain unclaimed. Therefore, by sorting buckets
				silvasUnsubmitted Done Reply Inline Actions Please mention that this is crucially dependent on the linear probing strategy. Also, I would add a scary comment on the loop that does the probing saying that it must be kept as linear probing. silvas: Please mention that this is crucially dependent on the linear probing strategy. Also, I would…
				// for each streak, we can make the entire hash table deterministic.
				//
				// Finally, we assign offsets to all SectionPieces associated with
				// this hash table.
				class ConcurrentStringTableBuilder {
				public:
				ConcurrentStringTableBuilder(size_t EstimatedNumEntries, size_t Alignment);

				void insert(SectionPiece &Piece, llvm::CachedHashStringRef Str);
				void finalize();
				bool isFull();
				void writeTo(uint8_t *Buf);
				size_t size() const { return StringTableSize; }

				private:
				// Bucket entry type.
				struct EntryTy {
				// std::sort needs these functions.
				EntryTy(const EntryTy &Other) { *this = Other; }

				EntryTy &operator=(const EntryTy &Other) {
				memcpy(this, &Other, sizeof(EntryTy));
				return *this;
				}

				EntryTy() = default;

				// Define a total order for string keys. The detail is not important,
				// but it needs to be defined that we can get deterministic output
				// from the hash table.
				bool operator<(const EntryTy &Other) const {
				size_t SizeA = Size.load();
				size_t SizeB = Other.Size.load();
				if (SizeA != SizeB)
				return SizeA < SizeB;
				return memcmp(Str.load(), Other.Str.load(), SizeA) < 0;
				};

				// String key information. String key is guaranteed to be unique
				// in a hash table.
				std::atomic<const char *> Str = {nullptr};
				std::atomic<uint32_t> Size = {0};

				// Offset in the string table. Filled by finalize().
				uint32_t Offset = 0;

				// SectionPieces having the same string contents are chained
				// using pointers. This pointer points to the first node.
				std::atomic<SectionPiece *> Piece = {nullptr};
				};

				// We could allocate tens of millions of objects of this type,
				// so we want to keep it small.
				static_assert(sizeof(EntryTy) <= 24, "EntryTy is too big");

				size_t align2(size_t Val) { return (Val + Alignment - 1) & ~(Alignment - 1); }
				void append(EntryTy &Bucket, SectionPiece &Piece);
				void sortBuckets();
				void sortWrapAround();

				// All strings are aligned to this boundary.
				size_t Alignment;

				// Filled by finalize().
				size_t NumEntries = 0;
				size_t StringTableSize = 0;

				// The number of allocated buckets and buckets.
				size_t NumBuckets;
				std::vector<EntryTy> Buckets;

				// The counter to track number of inserted strings.
				// Note that the number is approximate.
				std::atomic<size_t> Counter = {0};

				// We do not support table resizing. If it becomes almost full,
				// we should bail out and redo with a larger table. This bool
				// value keeps track of that.
				std::atomic<bool> IsTableFull = {false};
				};
				}
				}

				#endif

ELF/Concurrent.cpp

This file was added.

				//===- Concurrent.cpp -----------------------------------------------------===//
				//
				// The LLVM Linker
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//

				#include "Concurrent.h"
				#include "InputSection.h"
				#include "Threads.h"

				#include "llvm/ADT/CachedHashString.h"

				using namespace lld;
				using namespace lld::elf;
				using namespace llvm;

				// Our target load factor is 0.5.
				ConcurrentStringTableBuilder::ConcurrentStringTableBuilder(
				size_t EstimatedNumEntries, size_t Alignment)
				: Alignment(Alignment), NumBuckets(PowerOf2Ceil(EstimatedNumEntries * 2)),
				Buckets(NumBuckets) {}

				silvasUnsubmitted Not Done Reply Inline Actions Why double it? Also, you have a similar std::max calculation in MergeOutputSection<ELFT>::finalize. Do you need it in both places? silvas: Why double it? Also, you have a similar std::max calculation in MergeOutputSection<ELFT>…
				ruiuAuthorUnsubmitted Not Done Reply Inline Actions Because my target load factor is 0.5. I chose this number because I don't want to exceeds 0.75 at which point open-addressing hash table gets much slower Even if it can contain all given strings, a long streak of occupied buckets would make the hash table slower because of std::sort. In order to tune this formula, I need to run this against various programs, but that's not that much important at this moment. I'd do that later. ruiu: Because my target load factor is 0.5. I chose this number because - I don't want to exceeds 0.
				// Inserts a given section piece to the string table.
				void ConcurrentStringTableBuilder::insert(SectionPiece &Piece,
				silvasUnsubmitted Done Reply Inline Actions Can you use a std::unique_ptr<EntryTy[]> to manage this? silvas: Can you use a std::unique_ptr<EntryTy[]> to manage this?
				ruiuAuthorUnsubmitted Not Done Reply Inline Actions Changed to use std::vector. ruiu: Changed to use std::vector.
				CachedHashStringRef Str) {
				if (IsTableFull)
				return;
				assert(Str.size() != 0);

				size_t Start = Str.hash() & (NumBuckets - 1);

				for (size_t I = Start; I != Start - 1; I = (I + 1) & (NumBuckets - 1)) {
				EntryTy &Bucket = Buckets[I];

				// If the current bucket is empty, claim it.
				const char *Null = nullptr;
				if (Bucket.Str.compare_exchange_strong(Null, Str.val().data())) {
				Bucket.Size.store(Str.size());
				append(Bucket, Piece);

				// Update the counter for inserted items. We do this only in
				// 1/256 chance to avoid contention, so the counter is an
				// estimate but deterministic because it does not depend on any
				// randomness nor have race condition.
				if (Str.hash() % 256 != 0)
				return;
				size_t Cnt = Counter.fetch_add(256);

				// If 3/4 of the buckets are occupied, we mark this table as full.
				// Open-addressing table gets exponentially slower when it
				// approaches to load factor 1.0, and 3/4 is a reasonable cutoff.
				silvasUnsubmitted Not Done Reply Inline Actions I assume the idea here is to identify as early as possible whether we underestimated the number of buckets. I like this idea. Do you have any justification for the choice of numbers? Why is this better than just waiting for the table to become full? How much better? Should the check be 50%? (Knuth Vol 3 Sec 6.4 provides a good analysis for linearly probed open-addressed hash tables; the lookup costs will skyrocket after 75% load; is that your motivation? It would be good to have a comment) It may be better to instead track the duplication factor (or an approximation thereof). That way, even if we end up full this time, the caller can ask for the estimated duplication factor and get it right next time with high probability. This will probably be enough to avoid any worry about an excessive number of retries (with the current 10x duplication factor assumption, we might need up to 4 retries with doubling like you have it now; with an estimate, we could make it max of 2 with high probability). Another possibility is Also, I would like to see a comment justifying why it is deterministic whether or not IsTableFull is set to true. Note that if it is nondeterministic then LLD's output will be nondeterministic, so it is important to get this right. silvas: I assume the idea here is to identify as early as possible whether we underestimated the number…
				ruiuAuthorUnsubmitted Not Done Reply Inline Actions I do not have precise reasoning for these questions, I can give my guts. These should be room to optimize it, but we can do that later. Why it is 0.75 and not 0.5? I think the load factor 0.5 is too early to give up. When we give up, we need to create a new table and restart. We should continue using the same table until it approaches to much higher load. 0.75 seems like a good cutoff because beyond that it gets really slow. We should keep track the duplication factor, shouldn't we? We should if we could. But currently we stop adding items to the hash table when it gets "full", so once it becomes full, we don't know whether remaining strings are duplicate or not. I added comment about the load factor, and simplified code for isFull(). I hope this makes things clear. ruiu: I do not have precise reasoning for these questions, I can give my guts. These should be room…
				// If we have more numbers of items, we should bail out early
				// and redo with a larger table.
				if (Cnt * 4 > NumBuckets * 3)
				IsTableFull = true;
				return;
				}

				// The current bucket contains some string. Its size might not be
				// written by other thread yet, so try loading until it becomes
				// observable.
				const char *OldStr = Bucket.Str.load();
				uint64_t OldSize = 0;
				while (OldSize == 0)
				OldSize = Bucket.Size.load();

				// If the current bucket contains the key that we are looking for,
				// append Piece to the bucket. Otherwise, go to next bucket.
				if (Str.val() == StringRef(OldStr, OldSize)) {
				append(Bucket, Piece);
				return;
				silvasUnsubmitted Not Done Reply Inline Actions I think this technically needs to be atomic to avoid races. silvas: I think this technically needs to be atomic to avoid races.
				ruiuAuthorUnsubmitted Not Done Reply Inline Actions Yeah, I was tempted to say that this is a benign race, but by experts there is no such thing like "benign race". Fixed. ruiu: Yeah, I was tempted to say that this is a benign race, but by experts there is no such thing…
				}
				}
				IsTableFull = true;
				}

				void ConcurrentStringTableBuilder::finalize() {
				if (IsTableFull)
				return;

				// Sort buckets to make the hash table contents deterministic.
				sortBuckets();
				sortWrapAround();

				// Set an offset in the string table for each bucket.
				size_t Off = 0;
				for (size_t I = 0; I < NumBuckets; ++I) {
				if (size_t Size = Buckets[I].Size.load()) {
				Off = align2(Off);
				Buckets[I].Offset = Off;
				Off += Buckets[I].Size.load();
				++NumEntries;
				}
				}
				StringTableSize = Off;

				// Update SectionPieces' offsets.
				forLoop(0, NumBuckets, [&](size_t I) {
				SectionPiece *Cur = Buckets[I].Piece.load();
				while (Cur) {
				// OffsetOff and Next are a union, so we need to save Next before
				// writing to OutputOff.
				SectionPiece *Next = Cur->Next;
				Cur->OutputOff = Buckets[I].Offset;
				Cur = Next;
				}
				});
				}

				bool ConcurrentStringTableBuilder::isFull() { return IsTableFull; }

				void ConcurrentStringTableBuilder::writeTo(uint8_t *Buf) {
				assert(!isFull());

				forLoop(size_t(0), NumBuckets, [&](size_t I) {
				if (const char *Str = Buckets[I].Str.load()) {
				size_t Size = Buckets[I].Size.load();
				memcpy(Buf + Buckets[I].Offset, Str, Size);
				}
				});
				}

				// Atomically append Piece to Bucket.
				void ConcurrentStringTableBuilder::append(EntryTy &Bucket,
				SectionPiece &Piece) {
				for (;;) {
				Piece.Next = Bucket.Piece.load();
				if (Bucket.Piece.compare_exchange_weak(Piece.Next, &Piece))
				return;
				}
				}

				// Find runs of occupied buckets and sort them. This assumes we are
				// using linera probing to find an unused bucket.
				void ConcurrentStringTableBuilder::sortBuckets() {
				for (size_t I = 0; I < NumBuckets;) {
				if (Buckets[I].Size.load() == 0) {
				++I;
				continue;
				}
				size_t Begin = I;
				size_t End = I + 1;

				while (End < NumBuckets && Buckets[End].Size.load())
				++End;
				std::sort(Buckets.begin() + Begin, Buckets.begin() + End);
				I = End;
				}
				}

				// After the end of the buckets, it wraps around to the beginning,
				// so we need to sort them as one unit. sortBuckets() didn't handle
				// this corner case.
				void ConcurrentStringTableBuilder::sortWrapAround() {
				if (Buckets[0].Size.load() == 0 \|\| Buckets[NumBuckets - 1].Size.load() == 0)
				return;

				// Copies a wrapped-around streak to a vector, sort the vector,
				// and then write them back.
				size_t First = 1;
				while (Buckets[First].Size.load() && First < NumBuckets)
				++First;
				if (First == NumBuckets) {
				IsTableFull = true;
				return;
				}

				size_t Last = NumBuckets - 1;
				while (Buckets[Last - 1].Size.load())
				--Last;

				std::vector<EntryTy> V;
				auto Begin = Buckets.begin();
				V.insert(V.end(), Begin, Begin + First);
				V.insert(V.end(), Begin + Last, Begin + NumBuckets);

				std::sort(V.begin(), V.end());

				size_t LastSize = NumBuckets - Last;
				std::copy(V.begin(), V.begin() + LastSize, Begin + Last);
				std::copy(V.begin() + LastSize, V.end(), Begin);
				}

ELF/InputSection.h

	Show First 20 Lines • Show All 154 Lines • ▼ Show 20 Lines
	};			};

	// SectionPiece represents a piece of splittable section contents.			// SectionPiece represents a piece of splittable section contents.
	// We allocate a lot of these and binary search on them. This means that they			// We allocate a lot of these and binary search on them. This means that they
	// have to be as compact as possible, which is why we don't store the size (can			// have to be as compact as possible, which is why we don't store the size (can
	// be found by looking at the next one) and put the hash in a side table.			// be found by looking at the next one) and put the hash in a side table.
	struct SectionPiece {			struct SectionPiece {
	SectionPiece(size_t Off, bool Live = false)			SectionPiece(size_t Off, bool Live = false)
	: InputOff(Off), OutputOff(-1), Live(Live \|\| !Config->GcSections) {}			: InputOff(Off), Live(Live \|\| !Config->GcSections), OutputOff(-1) {}

	size_t InputOff;			size_t InputOff : 8 * sizeof(ssize_t) - 1;
	ssize_t OutputOff : 8 * sizeof(ssize_t) - 1;
	size_t Live : 1;			size_t Live : 1;
				union {
				ssize_t OutputOff;
				SectionPiece *Next; // used by ConcurrentStringTableBuilder
				};
	};			};
	static_assert(sizeof(SectionPiece) == 2 * sizeof(size_t),			static_assert(sizeof(SectionPiece) == 2 * sizeof(size_t),
	"SectionPiece is too big");			"SectionPiece is too big");

	// This corresponds to a SHF_MERGE section of an input file.			// This corresponds to a SHF_MERGE section of an input file.
	template <class ELFT> class MergeInputSection : public InputSectionBase<ELFT> {			template <class ELFT> class MergeInputSection : public InputSectionBase<ELFT> {
	typedef typename ELFT::uint uintX_t;			typedef typename ELFT::uint uintX_t;
	typedef typename ELFT::Sym Elf_Sym;			typedef typename ELFT::Sym Elf_Sym;
	▲ Show 20 Lines • Show All 136 Lines • Show Last 20 Lines

ELF/OutputSections.h

Show All 15 Lines
#include "lld/Core/LLVM.h"		#include "lld/Core/LLVM.h"
#include "llvm/MC/StringTableBuilder.h"		#include "llvm/MC/StringTableBuilder.h"
#include "llvm/Object/ELF.h"		#include "llvm/Object/ELF.h"

namespace lld {		namespace lld {
namespace elf {		namespace elf {

class SymbolBody;		class SymbolBody;
		class ConcurrentStringTableBuilder;
struct EhSectionPiece;		struct EhSectionPiece;
template <class ELFT> class EhInputSection;		template <class ELFT> class EhInputSection;
template <class ELFT> class InputSection;		template <class ELFT> class InputSection;
template <class ELFT> class InputSectionBase;		template <class ELFT> class InputSectionBase;
template <class ELFT> class MergeInputSection;		template <class ELFT> class MergeInputSection;
template <class ELFT> class OutputSection;		template <class ELFT> class OutputSection;
template <class ELFT> class ObjectFile;		template <class ELFT> class ObjectFile;
template <class ELFT> class SharedFile;		template <class ELFT> class SharedFile;
▲ Show 20 Lines • Show All 106 Lines • ▼ Show 20 Lines	public:
void finalize() override;		void finalize() override;
bool shouldTailMerge() const;		bool shouldTailMerge() const;
Kind getKind() const override { return Merge; }		Kind getKind() const override { return Merge; }
static bool classof(const OutputSectionBase *B) {		static bool classof(const OutputSectionBase *B) {
return B->getKind() == Merge;		return B->getKind() == Merge;
}		}

private:		private:
		void finalizeDefault();
void finalizeTailMerge();		void finalizeTailMerge();
void finalizeNoTailMerge();		void finalizeConcurrent(size_t NumPieces);

llvm::StringTableBuilder Builder;		llvm::StringTableBuilder Builder;
		std::unique_ptr<ConcurrentStringTableBuilder> ConcurrentBuilder;

std::vector<MergeInputSection<ELFT> *> Sections;		std::vector<MergeInputSection<ELFT> *> Sections;
		size_t StringAlignment;
};		};

struct CieRecord {		struct CieRecord {
EhSectionPiece *Piece = nullptr;		EhSectionPiece *Piece = nullptr;
std::vector<EhSectionPiece *> FdePieces;		std::vector<EhSectionPiece *> FdePieces;
};		};

// Output section for .eh_frame.		// Output section for .eh_frame.
▲ Show 20 Lines • Show All 120 Lines • Show Last 20 Lines

ELF/OutputSections.cpp

//===- OutputSections.cpp -------------------------------------------------===//		//===- OutputSections.cpp -------------------------------------------------===//
//		//
// The LLVM Linker		// The LLVM Linker
//		//
// This file is distributed under the University of Illinois Open Source		// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.		// License. See LICENSE.TXT for details.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "OutputSections.h"		#include "OutputSections.h"
		#include "Concurrent.h"
#include "Config.h"		#include "Config.h"
#include "EhFrame.h"		#include "EhFrame.h"
#include "LinkerScript.h"		#include "LinkerScript.h"
#include "Memory.h"		#include "Memory.h"
#include "Strings.h"		#include "Strings.h"
#include "SymbolTable.h"		#include "SymbolTable.h"
#include "SyntheticSections.h"		#include "SyntheticSections.h"
#include "Target.h"		#include "Target.h"
▲ Show 20 Lines • Show All 444 Lines • ▼ Show 20 Lines	if (In<ELFT>::EhFrameHdr) {
}		}
}		}
}		}

template <class ELFT>		template <class ELFT>
MergeOutputSection<ELFT>::MergeOutputSection(StringRef Name, uint32_t Type,		MergeOutputSection<ELFT>::MergeOutputSection(StringRef Name, uint32_t Type,
uintX_t Flags, uintX_t Alignment)		uintX_t Flags, uintX_t Alignment)
: OutputSectionBase(Name, Type, Flags),		: OutputSectionBase(Name, Type, Flags),
Builder(StringTableBuilder::RAW, Alignment) {}		Builder(StringTableBuilder::RAW, Alignment), StringAlignment(Alignment) {}

		silvasUnsubmitted Done Reply Inline Actions This is extremely troubling as it implies that we need to guarantee that we have chosen the size large enough or else LLD will have undefined behavior. silvas: This is extremely troubling as it implies that we need to guarantee that we have chosen the…
template <class ELFT> void MergeOutputSection<ELFT>::writeTo(uint8_t *Buf) {		template <class ELFT> void MergeOutputSection<ELFT>::writeTo(uint8_t *Buf) {
		if (ConcurrentBuilder)
		ConcurrentBuilder->writeTo(Buf);
		else
Builder.write(Buf);		Builder.write(Buf);
}		}

template <class ELFT>		template <class ELFT>
void MergeOutputSection<ELFT>::addSection(InputSectionData *C) {		void MergeOutputSection<ELFT>::addSection(InputSectionData *C) {
auto *Sec = cast<MergeInputSection<ELFT>>(C);		auto *Sec = cast<MergeInputSection<ELFT>>(C);
Sec->OutSec = this;		Sec->OutSec = this;
this->updateAlignment(Sec->Alignment);		this->updateAlignment(Sec->Alignment);
this->Entsize = Sec->Entsize;		this->Entsize = Sec->Entsize;
Sections.push_back(Sec);		Sections.push_back(Sec);
}		}

template <class ELFT> bool MergeOutputSection<ELFT>::shouldTailMerge() const {		template <class ELFT> bool MergeOutputSection<ELFT>::shouldTailMerge() const {
return (this->Flags & SHF_STRINGS) && Config->Optimize >= 2;		return (this->Flags & SHF_STRINGS) && Config->Optimize >= 2;
}		}
		silvasUnsubmitted Done Reply Inline Actions This property depends critically on the linear probing. This would not hold if we used quadratic probing. It would be good to mention that. silvas: This property depends critically on the linear probing. This would not hold if we used…

template <class ELFT> void MergeOutputSection<ELFT>::finalizeTailMerge() {		template <class ELFT> void MergeOutputSection<ELFT>::finalizeTailMerge() {
// Add all string pieces to the string table builder to create section		// Add all string pieces to the string table builder to create section
// contents.		// contents.
for (MergeInputSection<ELFT> *Sec : Sections)		for (MergeInputSection<ELFT> *Sec : Sections)
for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I)		for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I)
if (Sec->Pieces[I].Live)		if (Sec->Pieces[I].Live)
Builder.add(Sec->getData(I));		Builder.add(Sec->getData(I));

// Fix the string table content. After this, the contents will never change.		// Fix the string table content. After this, the contents will never change.
Builder.finalize();		Builder.finalize();
this->Size = Builder.getSize();		this->Size = Builder.getSize();

// finalize() fixed tail-optimized strings, so we can now get		// finalize() fixed tail-optimized strings, so we can now get
// offsets of strings. Get an offset for each string and save it		// offsets of strings. Get an offset for each string and save it
// to a corresponding StringPiece for easy access.		// to a corresponding StringPiece for easy access.
for (MergeInputSection<ELFT> *Sec : Sections)		for (MergeInputSection<ELFT> *Sec : Sections)
for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I)		for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I)
if (Sec->Pieces[I].Live)		if (Sec->Pieces[I].Live)
Sec->Pieces[I].OutputOff = Builder.getOffset(Sec->getData(I));		Sec->Pieces[I].OutputOff = Builder.getOffset(Sec->getData(I));
}		}

template <class ELFT> void MergeOutputSection<ELFT>::finalizeNoTailMerge() {		template <class ELFT> void MergeOutputSection<ELFT>::finalizeDefault() {
// Add all string pieces to the string table builder to create section		// Add all string pieces to the string table builder to create section
// contents. Because we are not tail-optimizing, offsets of strings are		// contents. Because we are not tail-optimizing, offsets of strings are
// fixed when they are added to the builder (string table builder contains		// fixed when they are added to the builder (string table builder contains
// a hash table from strings to offsets).		// a hash table from strings to offsets).
for (MergeInputSection<ELFT> *Sec : Sections)		for (MergeInputSection<ELFT> *Sec : Sections)
for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I)		for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I)
if (Sec->Pieces[I].Live)		if (Sec->Pieces[I].Live)
Sec->Pieces[I].OutputOff = Builder.add(Sec->getData(I));		Sec->Pieces[I].OutputOff = Builder.add(Sec->getData(I));

Builder.finalizeInOrder();		Builder.finalizeInOrder();
this->Size = Builder.getSize();		this->Size = Builder.getSize();
}		}

		template <class ELFT>
		void MergeOutputSection<ELFT>::finalizeConcurrent(size_t NumPieces) {
		// The concurrent hash table does not support resizing,
		silvasUnsubmitted Done Reply Inline Actions Won't this end up smaller for 32-bit hosts? silvas: Won't this end up smaller for 32-bit hosts?
		// so if it becomes full, redo with a larger table.
		// Our initial estimation is that the table contains 14 duplicate
		// entries for an item on average.
		size_t Estimate = std::max<size_t>(NumPieces / 15, 1024);

		for (;;) {
		ConcurrentBuilder.reset(
		new ConcurrentStringTableBuilder(Estimate, StringAlignment));

		// Approximate number of inserted items.
		std::atomic<size_t> NumInserted = {0};

		// Insert strings to the table.
		parallel_for_each(
		Sections.begin(), Sections.end(), [&](MergeInputSection<ELFT> *Sec) {
		if (ConcurrentBuilder->isFull())
		return;
		NumInserted.fetch_add(Sec->Pieces.size());
		for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I)
		if (Sec->Pieces[I].Live)
		ConcurrentBuilder->insert(Sec->Pieces[I], Sec->getData(I));
		});

		ConcurrentBuilder->finalize();
		if (!ConcurrentBuilder->isFull())
		break;

		// If the table became full, we need to redo. We know how many
		// strings were inserted before the table got full, so we can
		// make an estimate based on that number.
		double Factor = (double)NumInserted.load() / NumPieces;
		Estimate = Estimate * (1 / Factor);
		}
		this->Size = ConcurrentBuilder->size();
		}

template <class ELFT> void MergeOutputSection<ELFT>::finalize() {		template <class ELFT> void MergeOutputSection<ELFT>::finalize() {
if (shouldTailMerge())		// If -O2 is specified, we do tail merging.
		if (shouldTailMerge()) {
finalizeTailMerge();		finalizeTailMerge();
		return;
		}

		// If -no-threads is specified, we can't use the concurrent map.
		if (!Config->Threads) {
		finalizeDefault();
		silvasUnsubmitted Done Reply Inline Actions This must be unreachable, or it must somehow signal this so that the calling code can retry the string table building with a bigger size. We can't have LLD fail to link because a user's object files don't have enough duplicate strings. silvas: This must be unreachable, or it must somehow signal this so that the calling code can retry the…
		return;
		}

		// If we have a very large number of mergeable strings, use the
		// concurrent string table builder. Otherwise, use the single-
		// threaded one. There's an overhead of using the concurrent one,
		// so we don't want to use that unconditionally. The threshold
		// is currently set to 100,000.

		size_t NumPieces = 0;
		for (MergeInputSection<ELFT> *Sec : Sections)
		NumPieces += Sec->Pieces.size();

		// This is for unit test.
		if (StringRef(getenv("LLD_USE_CONCURRENT_STRING_TABLE_BUILDER")) == "1") {
		silvasUnsubmitted Done Reply Inline Actions Where is the accompanying test? Also, please make the name more specific. silvas: Where is the accompanying test? Also, please make the name more specific.
		finalizeConcurrent(NumPieces);
		return;
		}

		if (NumPieces < 100000)
		finalizeDefault();
else		else
finalizeNoTailMerge();		finalizeConcurrent(NumPieces);
}		}

template <class ELFT>		template <class ELFT>
static typename ELFT::uint getOutFlags(InputSectionBase<ELFT> *S) {		static typename ELFT::uint getOutFlags(InputSectionBase<ELFT> *S) {
return S->Flags & ~SHF_GROUP & ~SHF_COMPRESSED;		return S->Flags & ~SHF_GROUP & ~SHF_COMPRESSED;
}		}

template <class ELFT>		template <class ELFT>
Show All 12 Lines	static SectionKey<ELFT::Is64Bits> createKey(InputSectionBase<ELFT> *C,
if (isa<MergeInputSection<ELFT>>(C) \|\|		if (isa<MergeInputSection<ELFT>>(C) \|\|
(Config->Relocatable && (C->Flags & SHF_MERGE)))		(Config->Relocatable && (C->Flags & SHF_MERGE)))
Alignment = std::max<uintX_t>(C->Alignment, C->Entsize);		Alignment = std::max<uintX_t>(C->Alignment, C->Entsize);

return SectionKey<ELFT::Is64Bits>{OutsecName, C->Type, Flags, Alignment};		return SectionKey<ELFT::Is64Bits>{OutsecName, C->Type, Flags, Alignment};
}		}

template <class ELFT>		template <class ELFT>
std::pair<OutputSectionBase *, bool>		std::pair<OutputSectionBase *, bool>
		silvasUnsubmitted Not Done Reply Inline Actions I'm very concerned that there may be user scenarios where we end up almost always needing multiple trips through this loop (e.g., maybe "most" non-debug builds end up with only 2 duplicates on average?). How will we determine if this is the case? Otherwise, this "optimization" may backfire and users in the wild will get slower links; we need to have some feedback loop to correct this if it is the case, or be really confident that this optimization will always speed things up. I would propose that we do a run of Poudriere or Debian or Gentoo with this patch applied and treat the resize case as a fatal error (and ideally the error message would also give the number of resizes). That way we can get an idea of how common this is. silvas: I'm very concerned that there may be user scenarios where we end up almost always needing…
		ruiuAuthorUnsubmitted Not Done Reply Inline Actions I added code here to get better estimation. Now we keep track an approximate number N of successfully inserted items. If N < NumPieces, only N / NumPieces were inserted, so we need to enlarge the table by the inverse, namely NumPieces / N. ruiu: I added code here to get better estimation. Now we keep track an approximate number N of…
OutputSectionFactory<ELFT>::create(InputSectionBase<ELFT> *C,		OutputSectionFactory<ELFT>::create(InputSectionBase<ELFT> *C,
StringRef OutsecName) {		StringRef OutsecName) {
SectionKey<ELFT::Is64Bits> Key = createKey(C, OutsecName);		SectionKey<ELFT::Is64Bits> Key = createKey(C, OutsecName);
return create(Key, C);		return create(Key, C);
}		}

template <class ELFT>		template <class ELFT>
std::pair<OutputSectionBase *, bool>		std::pair<OutputSectionBase *, bool>
▲ Show 20 Lines • Show All 85 Lines • Show Last 20 Lines

test/ELF/merge-strings-concurrent.s

This file was added.

				// REQUIRES: x86
				// RUN: llvm-mc -filetype=obj -triple=x86_64-pc-linux %s -o %t.o
				// RUN: env LLD_USE_CONCURRENT_STRING_TABLE_BUILDER=1 ld.lld %t.o -o %t.so -shared
				// RUN: llvm-readobj -s -section-data -t %t.so \| FileCheck %s

				.section .rodata.str1.1, "aMS", @progbits,1
				.asciz "abc"
				.asciz "def"
				.asciz "ghijklmn"
				.asciz "o"
				.asciz "pqrstuvwxyz"

				.section .rodata.str2.2, "aMS", @progbits,1
				.asciz "ABC"
				.asciz "def"
				.asciz "ghijklmn"
				.asciz "O"
				.asciz "pqrstuvwxyz"

				// CHECK: Name: .rodata (1)
				// CHECK: SectionData (
				// CHECK-NEXT: 0000: 41424300 64656600 61626300 6768696A \|ABC.def.abc.ghij\|
				// CHECK-NEXT: 0010: 6B6C6D6E 006F004F 00707172 73747576 \|klmn.o.O.pqrstuv\|
				// CHECK-NEXT: 0020: 7778797A 00 \|wxyz.\|
				// CHECK-NEXT: )