This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lld/trunk/ELF/
-
trunk/
-
ELF/
-
InputSection.h
-
InputSection.cpp
-
OutputSections.h
-
OutputSections.cpp
-
Writer.cpp

Differential D20645

Avoid doing binary search.
ClosedPublic

Authored by ruiu on May 25 2016, 2:11 PM.

Download Raw Diff

Details

Reviewers

silvas
• rafael

Commits

rG406b469de445: Avoid doing binary search.
rLLD270999: Avoid doing binary search.
rL270999: Avoid doing binary search.

Summary

MergedInputSection::getOffset is the busiest function in LLD if string
merging is enabled and input files have lots of mergeable sections.
It is usually the case when creating executable with debug info,
so it is pretty common.

The reason why it is slow is because it has to do faily complex
computations. For non-mergeable sections, section contents are
contiguous in output, so in order to compute an output offset,
we only have to add the output section's base address to an input
offset. But for mergeable strings, section contents are split for
merging, so they are not contigous. We've got to do some lookups.

We used to do binary search on the list of section pieces.
It is slow because I think it's hostile to branch prediction.

This patch replaces it with hash table lookup. Seems it's working
pretty well. Below is "perf stat -r10" output when linking clang
with debug info. In this case this patch speeds up about 5%.

Before:

    6721.936004 task-clock (msec)         #    1.001 CPUs utilized            ( +-  0.11% )
            234 context-switches          #    0.035 K/sec                    ( +-  5.15% )
              0 cpu-migrations            #    0.000 K/sec                    ( +- 50.92% )
      1,075,866 page-faults               #    0.160 M/sec                    ( +-  0.10% )
 18,754,243,967 cycles                    #    2.790 GHz                      ( +-  0.11% )
 10,081,501,978 stalled-cycles-frontend   #   53.76% frontend cycles idle     ( +-  0.19% )
<not supported> stalled-cycles-backend
 21,185,939,161 instructions              #    1.13  insns per cycle
                                          #    0.48  stalled cycles per insn  ( +-  0.02% )
  3,809,479,053 branches                  #  566.723 M/sec                    ( +-  0.01% )
    133,385,955 branch-misses             #    3.50% of all branches          ( +-  0.00% )

    6.717561895 seconds time elapsed                                          ( +-  0.11% )

After:

    6386.323956 task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.23% )
            353 context-switches          #    0.055 K/sec                    ( +- 18.85% )
              0 cpu-migrations            #    0.000 K/sec                    ( +- 55.28% )
      1,289,073 page-faults               #    0.202 M/sec                    ( +-  0.11% )
 17,817,334,814 cycles                    #    2.790 GHz                      ( +-  0.23% )
 10,395,036,549 stalled-cycles-frontend   #   58.34% frontend cycles idle     ( +-  0.37% )
<not supported> stalled-cycles-backend
 18,856,073,120 instructions              #    1.06  insns per cycle
                                          #    0.55  stalled cycles per insn  ( +-  0.06% )
  3,282,465,949 branches                  #  513.984 M/sec                    ( +-  0.06% )
     72,314,845 branch-misses             #    2.20% of all branches          ( +-  0.01% )

    6.384843079 seconds time elapsed                                          ( +-  0.24% )

Diff Detail

Repository: rL LLVM

Event Timeline

ruiu updated this revision to Diff 58510.May 25 2016, 2:11 PM

ruiu retitled this revision from to Avoid doing binary search..

ruiu updated this object.

ruiu added reviewers: • rafael, silvas.

ruiu added a subscriber: llvm-commits.

Redo benchmark with "taskset -c 0" to pin down the core to get more accurate reuslts.

ruiu updated this object.May 25 2016, 2:29 PM

IMO this is not worth doing. It's only 5% when this function and surrounding code is like 25%, so this doesn't really speed things up much at all.

Also, just bailing out on keeping this performance benefit in multithreaded mode seems like a poor engineering decision.

ELF/InputSection.cpp
533 ↗	(On Diff #58514)	Have you analyzed the pattern of this `Offset` value? I suspect there is some sort of pattern that is easier to cache than using a whole map (e.g. the values might be sequential). (as a first-order analysis, just add a printf here and see how well gzip compresses it)
540 ↗	(On Diff #58514)	Did you actually measure this? The actual tradeoffs for binary search vs a hash table are fairly complex and mostly about reducing serial dependencies through memory. http://www.pvk.ca/Blog/2012/07/03/binary-search-star-eliminates-star-branch-mispredictions/ http://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-pathological-case-for-caches/ http://arxiv.org/abs/1509.05053
547 ↗	(On Diff #58514)	This is sort of counterproductive...

This revision now requires changes to proceed.May 25 2016, 3:05 PM

I was surprised you said 5% overall improvement isn't worth it. The number seems really significant to me. It is not a microbenchmark but is a real result of linking clang. Just +41 lines improves the overall link time by 5%? To me, it is between "excellent" and "it's too good to be true so I'll apply this and verify it myself."

You can see the drop of number of branch mispredictions in the perf numbers.

In D20645#440099, @ruiu wrote:

I was surprised you said 5% overall improvement isn't worth it. The number seems really significant to me. It is not a microbenchmark

It is. It is basically a microbenchmark of an ad-hoc std::upper_bound-based associative container. This patch adds a hash table in front and makes this be only 20% of the total link time instead of 25% on one data set that you looked at. It's a band-aid and adds complexity which makes it more difficult to actually optimize the linker long-term.

Also, it punts on threads, so it doesn't stack with any gains we get there (hence not a long-term solution).

Have you tried just completely eliminating this code? E.g. creating separate "input sections" for each of the elements in the mergeable input section? Think about it like this: this code is currently 25% of the link time. Are there other solutions that are less than 25% of the link time? I think it is reasonable to expect to find a solution that only costs us 5%.

but is a real result of linking clang. Just +41 lines improves the overall link time by 5%? To me, it is between "excellent" and "it's too good to be true so I'll apply this and verify it myself."

You can see the drop of number of branch mispredictions in the perf numbers.

This kind of thinking will get you stuck in a local optimum. Think of the future and how much faster LLD will be than the current LLD. Is this patch part of that future? How much faster do you want LLD to be? How much faster can it be?

Note: on windows by default (and also trivially on tmpfs; relevant to google's internal build system) LLD does not have to wait on writing to a disk. LLD has a long way to go until it is even heavily bottlenecked by disk IO; let alone being limited by DRAM bandwidth (e.g. 50GB/s on a typical low-end machine).

I tried out this patch on a build of one of our games with debug info (output binary is around 300M; getOffset is called around 11.5M times), and I can't reproduce the significant performance difference you are seeing with this patch: http://reviews.llvm.org/F1986081 (vertical axis is run time, horizontal axis is run number)

This was on a mac. I think the extra memory cost of the OffsetMap might be hurting (and it may be worse on Windows since the virtual memory system there is even poorer usually). OffsetMap adds 2 * sizeof(uintX_t) * #pieces * (1/hashTableLoad) total memory, which for ELF64 is over 16 bytes per piece; for some string tables this densemap may be on the order of the size of the entire section.
On the profile, I see MergeInputSection::getOffset go from 20.5% to 10.4% with this patch (good), but I now see a new entry for MergeInputSection::finalizePieces at 10.3% of total link time, counteracting that. MergeInputSection::finalizePieces is composed of 4.3% (of total link time) in DenseMap's InsertIntoBucketImpl and 3.4% is DenseMap::grow (the rest is 2.6% self-time of finalizePieces).

As a point of comparison, consider this simple patch: http://reviews.llvm.org/P4309
{P4309}
Compared to the control and your current patch (blue and yellow), this patch (green) gives a pretty substantial speedup on this test case with this compiler (apple clang) on this os (mac): http://reviews.llvm.org/F1986091

Zoomed in view, normalized to median of "a": http://reviews.llvm.org/F1986092

Just the CachedLastPieceIndex + 1 case gets about 5% speedup on the test case I was looking at. Adding the CachedLastPieceIndex case brings it up to about 8%.
Obviously this approach is thread-hostile and probably not worth implementing, but hopefully it shows a couple things:

When so much time is spent on so few lines of code, even relatively trivial changes can cause large performance swings.
Performance improvements on one test case/platform/compiler (I mean the compiler used to build LLD itself) may be neutral (or even pessimizations) on other test cases/platforms/compilers.

To be sure I reran those measurements again, this time with the calls to the 3 different versions interleaved (reference, this patch ("rui"), and the silly alternative I posted ("alt")) which confirms the results: http://reviews.llvm.org/F1986202

Since I have a profiler open again (this time OSX Instruments), here are some more notes that might be useful:

For this testcase, about 200k of the 11.5M calls to getOffset are for fixed-size (non-SHF_STRINGS) sections where we can directly index instead of do binary search. If we wanted we could do a special case check and speed up those cases we could do that (but I don't think that would have a huge effect on the overall performance for this kind of test case so I didn't bother trying).

Also, one thing to note is that if we already have a sorted list of relocations referencing the merge section, then resolving their offset to their piece index can be done almost for free in InputSection.cpp:splitStrings since we are linearly scanning it anyway. InputSection.cpp:splitStrings is actually 11% of the profile so avoiding it, optimizing it, or at least doing other useful work (e.g. hashing all the strings or resolving offsets to pieces) while we do that is probably worth it. About 4.4% of total runtime seems to be in std::vector's __emplace_back_slow_path as called by InputSection.cpp:splitStrings suggesting that we may want to look into tuning our memory allocation.

We seem to spend about 15% of the link as self-time in InputSectionBase<ELFT>::relocateNonAlloc which can probably be dramatically reduced by manually pipelining the loop (or adding some prefetches to mitigate the serial dependencies through memory there). Also hoisting out some hot cases from the various switches and appropriate __builtin_expect will likely help a bit too.
There is about 5% self time in SymbolBody::getVA (it is about 25% of the total time, but 20% of this is MergeInputSection::getOffset) which may be the serial dependence through memory in the expression SC->OutSec->getVA(); there also might be some dispatching overhead through some switches and could probably benefit from __builtin_expect or similar (I haven't looked in detail).

We spend about 12% in memcpy/memmove when writing to the output, likely bottlenecked on the kernel virtual memory subsystem (besides servicing the page fault, the kernel has to zerofill every page we touch there; also the kernel has locking contention while servicing the page fault, explaining the poor scalability with threading). This is 332ms time which for a 300MB output works out to 0.9GB/s, at least an order of magnitude below DRAM bandwidth (which memcpy/memmove should easily max out), so by tuning that (e.g. ensuring we are using 2MB pages when available, spawning threads early on to start faulting it all in, maybe use pwrite instead of mmap, etc.) that cost can be reduced to almost nothing. Note that in an -O0 link the output is 421MB and 418ms in memcpy/memmove, which is about 1GB/s (i.e. not maxing out DRAM); but in the -O0 link about 30% of time is spent in memcpy/memmove for this testcase which means that the speedup for -O0 to be expected from tuning this is dramatic.
Mac's VM subsystem is pretty bad though; I expect Linux to do better here (e.g. automatically use 2MB pages, etc.). Worth measuring.

Overall, merge sections seem to account for about 50% of the total runtime in this test case. I see on the profile:
MergeInputSection::getOffset - 20.5%
MergeOutputSection::addSection - 16.8%
MergeInputSection::splitIntoPieces - 11.1%
At -O0, we manage to recover most of this of course (we are about 40% faster on this test case based on a quick measurement).

One other interesting thing I measured (by just adding a printf): about 35% of the calls to MergeInputSection::getOffset are duplicated. I.e. 35% of all calls to getOffset on a particular MergeInputSection are for offsets that have already had getOffset called on them. (this is easy to test by printing a line for each call to MergeInputSection::getOffset with the pair ((uintptr_t)this,Offset) and then sort|uniq them to see how many are actually distinct).

The idea for caching the last piece index came from a quick observation of the pattern of offsets passed to MergeInputSection::getOffset for each MergeInputSection. The access pattern is pretty structured (and somewhat pretty; this only shows the pattern for 100 MergeInputSection's (out of over 7000)): http://reviews.llvm.org/F1986332

Drilling down I found that the differences between successive piece indexes was 1 (i.e. sequential) in 50% of cases, and 0 in 7% of cases http://reviews.llvm.org/F1986337.

Besides those, there is a very long tail and adding more cases (besides CachedLastPieceIndex + 1 and CachedLastPieceIndex) didn't speed things up in my empirical testing. Also, using CachedLastPieceIndex as an initial "seed" to partition the binary search in the case of "cache miss" didn't help much in my measurements (in fact, it actually hurt; likely because it harmed locality (i.e. highly different sequences of elements were searched each time instead of always starting at the beginning and end of the pieces array, testing the same middle every time, etc.)).

Also, at least for this test case, there is an interesting distribution of calls to MergeInputSection::getOffset made per input section: http://reviews.llvm.org/F1986325

The area under the histogram is the total number of calls to MergeInputSection::getOffset. The horizontal axis is the number of calls to MergeInputSection::getOffset for a given MergeInputSection (e.g. there is an outlier where we called MergeInputSection::getOffset 30,000 times on the MergeInputSection). The vertical axis is the total number of calls to MergeInputSection::getOffset within each bucket of the histogram.
The thing that surprised me is that sections with a smaller number of calls to getOffset made into them become increasingly abundant so that the graph is relatively flat (varying only 2-3x) up to about 15,000 on the horizontal axis.

This looks good to me. If anyone can make it even faster awesome, send a patch. For now we keep the fastest that has been implemented.

The results I got comparing two lto builds were

chromium

master 4.360323783
patch  4.387683184 1.00627462601

chromium fast

master 1.677643336
patch  1.71745014 1.023727811

the gold plugin

master 0.369605994
patch  0.3585041 0.969962895136

clang

master 0.630896046
patch  0.608380056 0.964311093495

llvm-as

master 0.035529869
patch  0.034659865 0.975513447573

the gold plugin fsds

master 0.390681335
patch  0.379471413 0.971306737753

clang fsds

master 0.702253693
patch  0.678229592 0.965789996921

llvm-as fsds

master 0.031787751
patch  0.031102319 0.978437228856

scylla

master 4.194291385
patch  3.377171089 0.80518275413

Thank you for doing this. These numbers are interesting, although some
numbers seem to be different from what I saw. At least the number I saw for
this patch is positive.

Anyways, even though there might be room to improve it further, this patch
should be fine. It doesn't hurt other improvements. If there's a better
way, we can simply do it (that would probably be replacing the code in this
patch, but that's fine). I don't see any reason why this change is
particularly controversial. I'd say that the patch before this one (which
shrinks the size of SectionPiece) would be more controversial as it is
microscopic optimization.

Not directly related to this patch, and I'm not suggesting we should do it, but here's one radical idea that could make the mergeable section handling a lot faster: instead of storing mergeable strings into sections, make them symbols and handle symbol names as symbol contents. For example, for a mergeable string "foobar", a compiler create a symbol ".merge.foobar". When a linker see a symbol that starts with ".merge." prefix (or with a special symbol type), it creates section contents just like as it reserves space in .bss for common symbols by copying the symbol name to the section, and set the symbol's address to the address in the section.

The symbol table is the single central repository to identify and uniquify pieces of data and mergeable strings can piggyback on it. I think this way is a lot faster than the current scheme to merge strings using mergeable sections.

In D20645#441351, @ruiu wrote:

Not directly related to this patch, and I'm not suggesting we should do it, but here's one radical idea that could make the mergeable section handling a lot faster: instead of storing mergeable strings into sections, make them symbols and handle symbol names as symbol contents. For example, for a mergeable string "foobar", a compiler create a symbol ".merge.foobar". When a linker see a symbol that starts with ".merge." prefix (or with a special symbol type), it creates section contents just like as it reserves space in .bss for common symbols by copying the symbol name to the section, and set the symbol's address to the address in the section.

Yes, this is exactly what I mentioned above about creating separate "input sections" for each mergeable object. I was thinking about doing it at the section level because that seems more natural for ELF (the compiler already does this to some extent in -ffunction-sections -fdata-sections), but I guess it might make sense to do it at symbol level as a sort of "atom model".

These are the timings on windows: http://reviews.llvm.org/F1987812

Looks like this patch (labeled "rui") gives some speedup. I'm still a bit worried about the extra memory usage (on principle of not being a memory hog), but since this doesn't seem to pessimize anything then this seems fine.

A couple nits inline, but this LGTM.

ELF/InputSection.cpp
539 ↗	(On Diff #58514)	These branches may be less predictable than typical branches, but I think it is premature (based on what I've seen presented as analysis) to specifically say that the slowdown is due to branch misprediction. For example, the fact that changing the sizeof in r270717 made a significant performance difference suggests that there is a strong influence of memory access too and that branch misprediction isn't the only factor. (see e.g. the things I linked earlier; the performance of binary search is actually quite complex and, broadly speaking, the performance difference between binary search and hash tables is due to memory, not branch prediction). So I would prefer to remove this comment.
548 ↗	(On Diff #58514)	I'd really like to avoid this `if (!Config->Threads)` thing. For example, this is a confounding factor if somebody wants to analyze LLD's scalability with threads, so I'd rather we not have it unless it makes a huge difference (which IIRC you said it didn't).

This revision is now accepted and ready to land.May 26 2016, 9:29 PM

Closed by commit rL270999: Avoid doing binary search. (authored by ruiu). · Explain WhyMay 27 2016, 7:45 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lld/

trunk/

ELF/

3 lines

40 lines

3 lines

6 lines

5 lines

Diff 58783

lld/trunk/ELF/InputSection.h

Show First 20 Lines • Show All 139 Lines • ▼ Show 20 Lines	public:

// Mark the piece at a given offset live. Used by GC.		// Mark the piece at a given offset live. Used by GC.
void markLiveAt(uintX_t Offset) { LiveOffsets.insert(Offset); }		void markLiveAt(uintX_t Offset) { LiveOffsets.insert(Offset); }

// Translate an offset in the input section to an offset		// Translate an offset in the input section to an offset
// in the output section.		// in the output section.
uintX_t getOffset(uintX_t Offset);		uintX_t getOffset(uintX_t Offset);

		void finalizePieces();

private:		private:
		llvm::DenseMap<uintX_t, uintX_t> OffsetMap;
llvm::DenseSet<uintX_t> LiveOffsets;		llvm::DenseSet<uintX_t> LiveOffsets;
};		};

// This corresponds to a .eh_frame section of an input file.		// This corresponds to a .eh_frame section of an input file.
template <class ELFT> class EhInputSection : public SplitInputSection<ELFT> {		template <class ELFT> class EhInputSection : public SplitInputSection<ELFT> {
public:		public:
typedef typename ELFT::Shdr Elf_Shdr;		typedef typename ELFT::Shdr Elf_Shdr;
typedef typename ELFT::uint uintX_t;		typedef typename ELFT::uint uintX_t;
▲ Show 20 Lines • Show All 99 Lines • Show Last 20 Lines

lld/trunk/ELF/InputSection.cpp

Show First 20 Lines • Show All 507 Lines • ▼ Show 20 Lines	for (uintX_t Off : LiveOffsets)
this->getSectionPiece(Off)->Live = true;		this->getSectionPiece(Off)->Live = true;
}		}

template <class ELFT>		template <class ELFT>
bool MergeInputSection<ELFT>::classof(const InputSectionBase<ELFT> *S) {		bool MergeInputSection<ELFT>::classof(const InputSectionBase<ELFT> *S) {
return S->SectionKind == InputSectionBase<ELFT>::Merge;		return S->SectionKind == InputSectionBase<ELFT>::Merge;
}		}

		// Do binary search to get a section piece at a given input offset.
template <class ELFT>		template <class ELFT>
SectionPiece *SplitInputSection<ELFT>::getSectionPiece(uintX_t Offset) {		SectionPiece *SplitInputSection<ELFT>::getSectionPiece(uintX_t Offset) {
ArrayRef<uint8_t> D = this->getSectionData();		ArrayRef<uint8_t> D = this->getSectionData();
StringRef Data((const char *)D.data(), D.size());		StringRef Data((const char *)D.data(), D.size());
uintX_t Size = Data.size();		uintX_t Size = Data.size();
if (Offset >= Size)		if (Offset >= Size)
fatal("entry is past the end of the section");		fatal("entry is past the end of the section");

// Find the element this offset points to.		// Find the element this offset points to.
auto I = std::upper_bound(		auto I = std::upper_bound(
Pieces.begin(), Pieces.end(), Offset,		Pieces.begin(), Pieces.end(), Offset,
[](const uintX_t &A, const SectionPiece &B) { return A < B.InputOff; });		[](const uintX_t &A, const SectionPiece &B) { return A < B.InputOff; });
--I;		--I;
return &*I;		return &*I;
}		}

		// Returns the offset in an output section for a given input offset.
		// Because contents of a mergeable section is not contiguous in output,
		// it is not just an addition to a base output offset.
template <class ELFT>		template <class ELFT>
typename ELFT::uint MergeInputSection<ELFT>::getOffset(uintX_t Offset) {		typename ELFT::uint MergeInputSection<ELFT>::getOffset(uintX_t Offset) {
		auto It = OffsetMap.find(Offset);
		if (It != OffsetMap.end())
		return It->second;

		// If Offset is not at beginning of a section piece, it is not in the map.
		// In that case we need to search from the original section piece vector.
SectionPiece &Piece = *this->getSectionPiece(Offset);		SectionPiece &Piece = *this->getSectionPiece(Offset);
assert(Piece.Live);		assert(Piece.Live);

// Compute the Addend and if the Base is cached, return.
uintX_t Addend = Offset - Piece.InputOff;		uintX_t Addend = Offset - Piece.InputOff;
if (Piece.OutputOff != size_t(-1))		uintX_t Ret = Piece.OutputOff + Addend;
return Piece.OutputOff + Addend;		return Ret;
		}

// Map the base to the offset in the output section and cache it.		// Create a map from input offsets to output offsets for all section pieces.
ArrayRef<uint8_t> D = this->getSectionData();		// It is called after finalize().
StringRef Data((const char *)D.data(), D.size());		template <class ELFT> void MergeInputSection<ELFT>::finalizePieces() {
StringRef Entry = Data.substr(Piece.InputOff, Piece.size());		OffsetMap.grow(this->Pieces.size());
auto MOS = static_cast<MergeOutputSection<ELFT> >(this->OutSec);		for (SectionPiece &Piece : this->Pieces) {
Piece.OutputOff = MOS->getOffset(Entry);		if (!Piece.Live)
return Piece.OutputOff + Addend;		continue;
		if (Piece.OutputOff == size_t(-1)) {
		// Offsets of tail-merged strings are computed lazily.
		auto OutSec = static_cast<MergeOutputSection<ELFT> >(this->OutSec);
		ArrayRef<uint8_t> D = Piece.data();
		StringRef S((const char *)D.data(), D.size());
		Piece.OutputOff = OutSec->getOffset(S);
		}
		OffsetMap[Piece.InputOff] = Piece.OutputOff;
		}
}		}

template <class ELFT>		template <class ELFT>
MipsReginfoInputSection<ELFT>::MipsReginfoInputSection(elf::ObjectFile<ELFT> *F,		MipsReginfoInputSection<ELFT>::MipsReginfoInputSection(elf::ObjectFile<ELFT> *F,
const Elf_Shdr *Hdr)		const Elf_Shdr *Hdr)
: InputSectionBase<ELFT>(F, Hdr, InputSectionBase<ELFT>::MipsReginfo) {		: InputSectionBase<ELFT>(F, Hdr, InputSectionBase<ELFT>::MipsReginfo) {
// Initialize this->Reginfo.		// Initialize this->Reginfo.
ArrayRef<uint8_t> D = this->getSectionData();		ArrayRef<uint8_t> D = this->getSectionData();
▲ Show 20 Lines • Show All 71 Lines • Show Last 20 Lines

lld/trunk/ELF/OutputSections.h

Show First 20 Lines • Show All 87 Lines • ▼ Show 20 Lines	if (Align > Header.sh_addralign)
Header.sh_addralign = Align;		Header.sh_addralign = Align;
}		}

// If true, this section will be page aligned on disk.		// If true, this section will be page aligned on disk.
// Typically the first section of each PT_LOAD segment has this flag.		// Typically the first section of each PT_LOAD segment has this flag.
bool PageAlign = false;		bool PageAlign = false;

virtual void finalize() {}		virtual void finalize() {}
		virtual void finalizePieces() {}
virtual void		virtual void
forEachInputSection(std::function<void(InputSectionBase<ELFT> *)> F) {}		forEachInputSection(std::function<void(InputSectionBase<ELFT> *)> F) {}
virtual void writeTo(uint8_t *Buf) {}		virtual void writeTo(uint8_t *Buf) {}
virtual ~OutputSectionBase() = default;		virtual ~OutputSectionBase() = default;

protected:		protected:
StringRef Name;		StringRef Name;
Elf_Shdr Header;		Elf_Shdr Header;
▲ Show 20 Lines • Show All 211 Lines • ▼ Show 20 Lines

public:		public:
MergeOutputSection(StringRef Name, uint32_t Type, uintX_t Flags,		MergeOutputSection(StringRef Name, uint32_t Type, uintX_t Flags,
uintX_t Alignment);		uintX_t Alignment);
void addSection(InputSectionBase<ELFT> *S) override;		void addSection(InputSectionBase<ELFT> *S) override;
void writeTo(uint8_t *Buf) override;		void writeTo(uint8_t *Buf) override;
unsigned getOffset(StringRef Val);		unsigned getOffset(StringRef Val);
void finalize() override;		void finalize() override;
		void finalizePieces() override;
bool shouldTailMerge() const;		bool shouldTailMerge() const;

private:		private:
llvm::StringTableBuilder Builder;		llvm::StringTableBuilder Builder;
		std::vector<MergeInputSection<ELFT> *> Sections;
};		};

struct CieRecord {		struct CieRecord {
SectionPiece *Piece = nullptr;		SectionPiece *Piece = nullptr;
std::vector<SectionPiece *> FdePieces;		std::vector<SectionPiece *> FdePieces;
};		};

// Output section for .eh_frame.		// Output section for .eh_frame.
▲ Show 20 Lines • Show All 315 Lines • Show Last 20 Lines

lld/trunk/ELF/OutputSections.cpp

	Show First 20 Lines • Show All 1,158 Lines • ▼ Show 20 Lines
	}			}

	template <class ELFT>			template <class ELFT>
	void MergeOutputSection<ELFT>::addSection(InputSectionBase<ELFT> *C) {			void MergeOutputSection<ELFT>::addSection(InputSectionBase<ELFT> *C) {
	auto *Sec = cast<MergeInputSection<ELFT>>(C);			auto *Sec = cast<MergeInputSection<ELFT>>(C);
	Sec->OutSec = this;			Sec->OutSec = this;
	this->updateAlign(Sec->Align);			this->updateAlign(Sec->Align);
	this->Header.sh_entsize = Sec->getSectionHdr()->sh_entsize;			this->Header.sh_entsize = Sec->getSectionHdr()->sh_entsize;
				Sections.push_back(Sec);

	bool IsString = this->Header.sh_flags & SHF_STRINGS;			bool IsString = this->Header.sh_flags & SHF_STRINGS;

	for (SectionPiece &Piece : Sec->Pieces) {			for (SectionPiece &Piece : Sec->Pieces) {
	if (!Piece.Live)			if (!Piece.Live)
	continue;			continue;
	uintX_t OutputOffset = Builder.add(toStringRef(Piece.data()));			uintX_t OutputOffset = Builder.add(toStringRef(Piece.data()));
	if (!IsString \|\| !shouldTailMerge())			if (!IsString \|\| !shouldTailMerge())
	Show All 11 Lines
	}			}

	template <class ELFT> void MergeOutputSection<ELFT>::finalize() {			template <class ELFT> void MergeOutputSection<ELFT>::finalize() {
	if (shouldTailMerge())			if (shouldTailMerge())
	Builder.finalize();			Builder.finalize();
	this->Header.sh_size = Builder.getSize();			this->Header.sh_size = Builder.getSize();
	}			}

				template <class ELFT> void MergeOutputSection<ELFT>::finalizePieces() {
				for (MergeInputSection<ELFT> *Sec : Sections)
				Sec->finalizePieces();
				}

	template <class ELFT>			template <class ELFT>
	StringTableSection<ELFT>::StringTableSection(StringRef Name, bool Dynamic)			StringTableSection<ELFT>::StringTableSection(StringRef Name, bool Dynamic)
	: OutputSectionBase<ELFT>(Name, SHT_STRTAB,			: OutputSectionBase<ELFT>(Name, SHT_STRTAB,
	Dynamic ? (uintX_t)SHF_ALLOC : 0),			Dynamic ? (uintX_t)SHF_ALLOC : 0),
	Dynamic(Dynamic) {			Dynamic(Dynamic) {
	this->Header.sh_addralign = 1;			this->Header.sh_addralign = 1;
	}			}

	▲ Show 20 Lines • Show All 538 Lines • Show Last 20 Lines

lld/trunk/ELF/Writer.cpp

Show First 20 Lines • Show All 867 Lines • ▼ Show 20 Lines	template <class ELFT> void Writer<ELFT>::createSections() {
// finalized once the .dynamic finalizer has added a few last		// finalized once the .dynamic finalizer has added a few last
// strings. See DynamicSection::finalize()		// strings. See DynamicSection::finalize()
for (OutputSectionBase<ELFT> *Sec : OutputSections)		for (OutputSectionBase<ELFT> *Sec : OutputSections)
if (Sec != Out<ELFT>::DynStrTab && Sec != Out<ELFT>::Dynamic)		if (Sec != Out<ELFT>::DynStrTab && Sec != Out<ELFT>::Dynamic)
Sec->finalize();		Sec->finalize();

if (isOutputDynamic())		if (isOutputDynamic())
Out<ELFT>::Dynamic->finalize();		Out<ELFT>::Dynamic->finalize();

		// Now that all output offsets are fixed. Finalize mergeable sections
		// to fix their maps from input offsets to output offsets.
		for (OutputSectionBase<ELFT> *Sec : OutputSections)
		Sec->finalizePieces();
}		}

template <class ELFT> bool Writer<ELFT>::needsGot() {		template <class ELFT> bool Writer<ELFT>::needsGot() {
if (!Out<ELFT>::Got->empty())		if (!Out<ELFT>::Got->empty())
return true;		return true;

// We add the .got section to the result for dynamic MIPS target because		// We add the .got section to the result for dynamic MIPS target because
// its address and properties are mentioned in the .dynamic section.		// its address and properties are mentioned in the .dynamic section.
▲ Show 20 Lines • Show All 497 Lines • Show Last 20 Lines