This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lld/ELF/
-
ELF/
-
OutputSections.cpp
2/4
SyntheticSections.h
2/5
SyntheticSections.cpp

Differential D115993

[ELF] Optimize RelocationSection<ELFT>::writeTo
ClosedPublic

Authored by MaskRay on Dec 18 2021, 11:25 AM.

Download Raw Diff

Details

Reviewers

arichardson
ikudrin
peter.smith

Commits

rG6683099a0d0a: [ELF] Optimize RelocationSection<ELFT>::writeTo

Summary

When linking a 1.2G output (nearly no debug info, 2846621 dynamic relocations) using --threads=8, I measured

9.131462 Total ExecuteLinker
1.449913 Total Write output file
1.445784 Total Write sections
0.657152 Write sections {"detail":".rela.dyn"}

This change decreases the .rela.dyn time to 0.25, leading to 4% speed up in the total time.

The parallelSort is slow because of expensive r_sym/r_offset computation. Cache the values.
The iteration is slow. Move r_sym/r_addend computation ahead of time and parallelize it.

With the change, the new encodeDynamicReloc is cheap (0.05s). So don't parallelize it.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

MaskRay created this revision.Dec 18 2021, 11:25 AM

Herald added a subscriber: emaste. · View Herald TranscriptDec 18 2021, 11:25 AM

MaskRay requested review of this revision.Dec 18 2021, 11:25 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 18 2021, 11:25 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B139987: Diff 395302.Dec 18 2021, 11:36 AM

I have tried a serial radix sort whose performance is similar to the parallel qsort.
It has more code, has an allocation, and needs to sacrifice the D62141 ordering property (the property doesn't matter and can be sacrificed if justified):

// Sort by (!IsRelative,SymIndex). DT_REL[A]COUNT requires us to
// place R_*_RELATIVE first. SymIndex is to improve locality.
if (sort) {
  unsigned num = symTab->getNumSymbols();
  SmallVector<unsigned, 0> cnt(num);
  for (const DynamicReloc &rel : relocs)
    ++cnt[rel.r_sym];
  for (unsigned j = 0, i = 0; i != num; ++i) {
    unsigned t = j + cnt[i];
    cnt[i] = j;
    j = t;
  }
  auto tmp = relocs;
  unsigned j = 0, k = relocs.size();
  for (const DynamicReloc &rel : relocs)
    tmp[cnt[rel.r_sym]++] = rel;

  j = 0;
  for (const DynamicReloc &rel : tmp)
    if (rel.type == target->relativeRel)
      relocs[j++] = rel;
    else
      relocs[--k] = rel;
  std::reverse(relocs.begin() + k, relocs.end());
}

No objections from me. It may be worth getting some figures from the lld speed test (https://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20171030/498843.html) as well, although I expect they won't show regressions.

Will be out of office for the next couple of weeks so may be a bit slow to respond. I'm happy for others to comment/approve.

lld/ELF/SyntheticSections.h
480	Just thinking about making the enum size uint8_t Not entirely sure it will make a lot of difference in this case, but you may want to reorder some of the fields so that they minimise padding between items. For example addend before r_sym. Kind may also benefit from being last.

lld/ELF/SyntheticSections.cpp
1684–1686	Preferably, that should be a method in the `DynamicReloc` class.
lld/ELF/SyntheticSections.h
480	`RelType` is `uint32_t`, so there is no padding. Moving `Kind` at the end (as well as making it `uint8_t`) changes nothing as the whole class should be 64-bit aligned and thus is 64-bytes long regardless the changes.
486	I do not personally like optimizations when the meaning of the field is changed in different phases. That makes the code fragile and hard to support. At least, when `addend` is updated, `kind` should be adjusted, or else the object's state becomes inconsistent.

address comments

MaskRay marked an inline comment as done.Dec 20 2021, 7:29 PM

MaskRay added inline comments.

lld/ELF/SyntheticSections.h
480	Dropped the change to `Kind`. Moved `type` before `r_sym` because r_offset/r_sym/type/addend layout is more similar to `Elf64_Rela`.

Harbormaster completed remote builds in B140180: Diff 395578.Dec 20 2021, 7:45 PM

ikudrin added inline comments.Dec 21 2021, 8:38 AM

lld/ELF/SyntheticSections.cpp
1668	And now we have an inconsistency if `kind` was `AddendOnlyWithTargetVA` or `AgainstSymbolWithTargetVA` because `sym` is not `nullptr` and an `assert()` in `computeAddend()` can trigger. Maybe add a new kind `PostComputeRaw` and add `assert`s to `computeAddend()` and `needsDynSymIndex()` to ensure that they are not called after `computeRaw()`?

MaskRay added inline comments.Dec 21 2021, 9:12 AM

lld/ELF/SyntheticSections.cpp
1668	Every redundant operation costs. We have comprehensive tests so I do not worry much about inconsistency caused problems. Adding `PostComputeRaw` seems to sacrifice too much for the check.

LGTM

This revision is now accepted and ready to land.Dec 21 2021, 9:33 AM

Thanks:)

Closed by commit rG6683099a0d0a: [ELF] Optimize RelocationSection<ELFT>::writeTo (authored by MaskRay). · Explain WhyDec 21 2021, 9:44 AM

This revision was automatically updated to reflect the committed changes.

MaskRay added a commit: rG6683099a0d0a: [ELF] Optimize RelocationSection<ELFT>::writeTo.

This looks like a nice improvement to me. I also like @ikudrin's suggestion of validating internal consistency in debug builds, but if this has a measurable performance impact the current patch LGTM.

lld/ELF/SyntheticSections.cpp
1668	Not sure if that makes any difference in a release build since we already have the store in computeRaw(). One thing that might also make it more resilient to bugs introduced while refactoring (or to help modified downstream consumers with less comprehensive test coverage) would be to keep the `addend` field private and add a `r_addend()` accessor that asserts that we are in the `PostComputeRaw`/`ContentsFinalized`/`SomeOtherName` state.
1775–1779	Does it make sense to also parallelize this or is this not meaningful for the relocation counts encountered in practise?

Revision Contents

Path

Size

lld/

ELF/

OutputSections.cpp

2 lines

SyntheticSections.h

26 lines

SyntheticSections.cpp

41 lines

Diff 395711

lld/ELF/OutputSections.cpp

Show First 20 Lines • Show All 554 Lines • ▼ Show 20 Lines	parallelForEachN(0, sections.size(), [&](size_t i) {
// When linking with -r or --emit-relocs we might also call this function		// When linking with -r or --emit-relocs we might also call this function
// for input .rel[a].<sec> sections which we simply pass through to the		// for input .rel[a].<sec> sections which we simply pass through to the
// output. We skip over those and only look at the synthetic relocation		// output. We skip over those and only look at the synthetic relocation
// sections created during linking.		// sections created during linking.
const auto *sec = dyn_cast<RelocationBaseSection>(sections[i]);		const auto *sec = dyn_cast<RelocationBaseSection>(sections[i]);
if (!sec)		if (!sec)
return;		return;
for (const DynamicReloc &rel : sec->relocs) {		for (const DynamicReloc &rel : sec->relocs) {
int64_t addend = rel.computeAddend();		int64_t addend = rel.addend;
const OutputSection *relOsec = rel.inputSec->getOutputSection();		const OutputSection *relOsec = rel.inputSec->getOutputSection();
assert(relOsec != nullptr && "missing output section for relocation");		assert(relOsec != nullptr && "missing output section for relocation");
const uint8_t *relocTarget =		const uint8_t *relocTarget =
bufStart + relOsec->offset + rel.inputSec->getOffset(rel.offsetInSec);		bufStart + relOsec->offset + rel.inputSec->getOffset(rel.offsetInSec);
// For SHT_NOBITS the written addend is always zero.		// For SHT_NOBITS the written addend is always zero.
int64_t writtenAddend =		int64_t writtenAddend =
relOsec->type == SHT_NOBITS		relOsec->type == SHT_NOBITS
? 0		? 0
Show All 27 Lines

lld/ELF/SyntheticSections.h

Show First 20 Lines • Show All 443 Lines • ▼ Show 20 Lines	enum Kind {
/// This is used by the MIPS multi-GOT implementation. It relocates		/// This is used by the MIPS multi-GOT implementation. It relocates
/// addresses of 64kb pages that lie inside the output section.		/// addresses of 64kb pages that lie inside the output section.
MipsMultiGotPage,		MipsMultiGotPage,
};		};
/// This constructor records a relocation against a symbol.		/// This constructor records a relocation against a symbol.
DynamicReloc(RelType type, const InputSectionBase *inputSec,		DynamicReloc(RelType type, const InputSectionBase *inputSec,
uint64_t offsetInSec, Kind kind, Symbol &sym, int64_t addend,		uint64_t offsetInSec, Kind kind, Symbol &sym, int64_t addend,
RelExpr expr)		RelExpr expr)
: type(type), sym(&sym), inputSec(inputSec), offsetInSec(offsetInSec),		: sym(&sym), inputSec(inputSec), offsetInSec(offsetInSec), type(type),
kind(kind), expr(expr), addend(addend) {}		addend(addend), kind(kind), expr(expr) {}
/// This constructor records a relative relocation with no symbol.		/// This constructor records a relative relocation with no symbol.
DynamicReloc(RelType type, const InputSectionBase *inputSec,		DynamicReloc(RelType type, const InputSectionBase *inputSec,
uint64_t offsetInSec, int64_t addend = 0)		uint64_t offsetInSec, int64_t addend = 0)
: type(type), sym(nullptr), inputSec(inputSec), offsetInSec(offsetInSec),		: sym(nullptr), inputSec(inputSec), offsetInSec(offsetInSec), type(type),
kind(AddendOnly), expr(R_ADDEND), addend(addend) {}		addend(addend), kind(AddendOnly), expr(R_ADDEND) {}
/// This constructor records dynamic relocation settings used by the MIPS		/// This constructor records dynamic relocation settings used by the MIPS
/// multi-GOT implementation.		/// multi-GOT implementation.
DynamicReloc(RelType type, const InputSectionBase *inputSec,		DynamicReloc(RelType type, const InputSectionBase *inputSec,
uint64_t offsetInSec, const OutputSection *outputSec,		uint64_t offsetInSec, const OutputSection *outputSec,
int64_t addend)		int64_t addend)
: type(type), sym(nullptr), inputSec(inputSec), offsetInSec(offsetInSec),		: sym(nullptr), outputSec(outputSec), inputSec(inputSec),
kind(MipsMultiGotPage), expr(R_ADDEND), addend(addend),		offsetInSec(offsetInSec), type(type), addend(addend),
outputSec(outputSec) {}		kind(MipsMultiGotPage), expr(R_ADDEND) {}

uint64_t getOffset() const;		uint64_t getOffset() const;
uint32_t getSymIndex(SymbolTableBaseSection *symTab) const;		uint32_t getSymIndex(SymbolTableBaseSection *symTab) const;
bool needsDynSymIndex() const {		bool needsDynSymIndex() const {
return kind == AgainstSymbol \|\| kind == AgainstSymbolWithTargetVA;		return kind == AgainstSymbol \|\| kind == AgainstSymbolWithTargetVA;
}		}

/// Computes the addend of the dynamic relocation. Note that this is not the		/// Computes the addend of the dynamic relocation. Note that this is not the
/// same as the #addend member variable as it may also include the symbol		/// same as the #addend member variable as it may also include the symbol
/// address/the address of the corresponding GOT entry/etc.		/// address/the address of the corresponding GOT entry/etc.
int64_t computeAddend() const;		int64_t computeAddend() const;

RelType type;		void computeRaw(SymbolTableBaseSection *symtab);

		peter.smithUnsubmitted Not Done Reply Inline Actions Just thinking about making the enum size uint8_t Not entirely sure it will make a lot of difference in this case, but you may want to reorder some of the fields so that they minimise padding between items. For example addend before r_sym. Kind may also benefit from being last. peter.smith: Just thinking about making the enum size uint8_t Not entirely sure it will make a lot of…
		ikudrinUnsubmitted Not Done Reply Inline Actions `RelType` is `uint32_t`, so there is no padding. Moving `Kind` at the end (as well as making it `uint8_t`) changes nothing as the whole class should be 64-bit aligned and thus is 64-bytes long regardless the changes. ikudrin: `RelType` is `uint32_t`, so there is no padding. Moving `Kind` at the end (as well as making it…
		MaskRayAuthorUnsubmitted Done Reply Inline Actions Dropped the change to `Kind`. Moved `type` before `r_sym` because r_offset/r_sym/type/addend layout is more similar to `Elf64_Rela`. MaskRay: Dropped the change to `Kind`. Moved `type` before `r_sym` because r_offset/r_sym/type/addend…
Symbol *sym;		Symbol *sym;
		const OutputSection *outputSec = nullptr;
const InputSectionBase *inputSec;		const InputSectionBase *inputSec;
uint64_t offsetInSec;		uint64_t offsetInSec;
		uint64_t r_offset;
		RelType type;
		ikudrinUnsubmitted Done Reply Inline Actions I do not personally like optimizations when the meaning of the field is changed in different phases. That makes the code fragile and hard to support. At least, when `addend` is updated, `kind` should be adjusted, or else the object's state becomes inconsistent. ikudrin: I do not personally like optimizations when the meaning of the field is changed in different…
		uint32_t r_sym;
		// Initially input addend, then the output addend after
		// RelocationSection<ELFT>::writeTo.
		int64_t addend;

private:		private:
Kind kind;		Kind kind;
// The kind of expression used to calculate the added (required e.g. for		// The kind of expression used to calculate the added (required e.g. for
// relative GOT relocations).		// relative GOT relocations).
RelExpr expr;		RelExpr expr;
int64_t addend;
const OutputSection *outputSec = nullptr;
};		};

template <class ELFT> class DynamicSection final : public SyntheticSection {		template <class ELFT> class DynamicSection final : public SyntheticSection {
LLVM_ELF_IMPORT_TYPES_ELFT(ELFT)		LLVM_ELF_IMPORT_TYPES_ELFT(ELFT)

public:		public:
DynamicSection();		DynamicSection();
void finalizeContents() override;		void finalizeContents() override;
▲ Show 20 Lines • Show All 762 Lines • Show Last 20 Lines

lld/ELF/SyntheticSections.cpp

Show First 20 Lines • Show All 1,647 Lines • ▼ Show 20 Lines
}		}

RelrBaseSection::RelrBaseSection()		RelrBaseSection::RelrBaseSection()
: SyntheticSection(SHF_ALLOC,		: SyntheticSection(SHF_ALLOC,
config->useAndroidRelrTags ? SHT_ANDROID_RELR : SHT_RELR,		config->useAndroidRelrTags ? SHT_ANDROID_RELR : SHT_RELR,
config->wordsize, ".relr.dyn") {}		config->wordsize, ".relr.dyn") {}

template <class ELFT>		template <class ELFT>
static void encodeDynamicReloc(SymbolTableBaseSection *symTab,		static void encodeDynamicReloc(typename ELFT::Rela *p,
typename ELFT::Rela *p,
const DynamicReloc &rel) {		const DynamicReloc &rel) {
		p->r_offset = rel.r_offset;
		p->setSymbolAndType(rel.r_sym, rel.type, config->isMips64EL);
if (config->isRela)		if (config->isRela)
p->r_addend = rel.computeAddend();		p->r_addend = rel.addend;
p->r_offset = rel.getOffset();		}
p->setSymbolAndType(rel.getSymIndex(symTab), rel.type, config->isMips64EL);
		void DynamicReloc::computeRaw(SymbolTableBaseSection *symtab) {
		r_offset = getOffset();
		r_sym = getSymIndex(symtab);
		addend = computeAddend();
		kind = AddendOnly; // Catch errors
		ikudrinUnsubmitted Not Done Reply Inline Actions And now we have an inconsistency if `kind` was `AddendOnlyWithTargetVA` or `AgainstSymbolWithTargetVA` because `sym` is not `nullptr` and an `assert()` in `computeAddend()` can trigger. Maybe add a new kind `PostComputeRaw` and add `assert`s to `computeAddend()` and `needsDynSymIndex()` to ensure that they are not called after `computeRaw()`? ikudrin: And now we have an inconsistency if `kind` was `AddendOnlyWithTargetVA` or…
		MaskRayAuthorUnsubmitted Done Reply Inline Actions Every redundant operation costs. We have comprehensive tests so I do not worry much about inconsistency caused problems. Adding `PostComputeRaw` seems to sacrifice too much for the check. MaskRay: Every redundant operation costs. We have comprehensive tests so I do not worry much about…
		arichardsonUnsubmitted Not Done Reply Inline Actions Not sure if that makes any difference in a release build since we already have the store in computeRaw(). One thing that might also make it more resilient to bugs introduced while refactoring (or to help modified downstream consumers with less comprehensive test coverage) would be to keep the `addend` field private and add a `r_addend()` accessor that asserts that we are in the `PostComputeRaw`/`ContentsFinalized`/`SomeOtherName` state. arichardson: Not sure if that makes any difference in a release build since we already have the store in…
}		}

template <class ELFT>		template <class ELFT>
RelocationSection<ELFT>::RelocationSection(StringRef name, bool sort)		RelocationSection<ELFT>::RelocationSection(StringRef name, bool sort)
: RelocationBaseSection(name, config->isRela ? SHT_RELA : SHT_REL,		: RelocationBaseSection(name, config->isRela ? SHT_RELA : SHT_REL,
config->isRela ? DT_RELA : DT_REL,		config->isRela ? DT_RELA : DT_REL,
config->isRela ? DT_RELASZ : DT_RELSZ),		config->isRela ? DT_RELASZ : DT_RELSZ),
sort(sort) {		sort(sort) {
this->entsize = config->isRela ? sizeof(Elf_Rela) : sizeof(Elf_Rel);		this->entsize = config->isRela ? sizeof(Elf_Rela) : sizeof(Elf_Rel);
}		}

template <class ELFT> void RelocationSection<ELFT>::writeTo(uint8_t *buf) {		template <class ELFT> void RelocationSection<ELFT>::writeTo(uint8_t *buf) {
SymbolTableBaseSection *symTab = getPartition().dynSymTab;		SymbolTableBaseSection *symTab = getPartition().dynSymTab;

		parallelForEach(relocs,
		[symTab](DynamicReloc &rel) { rel.computeRaw(symTab); });
// Sort by (!IsRelative,SymIndex,r_offset). DT_REL[A]COUNT requires us to		// Sort by (!IsRelative,SymIndex,r_offset). DT_REL[A]COUNT requires us to
// place R_*_RELATIVE first. SymIndex is to improve locality, while r_offset		// place R_*_RELATIVE first. SymIndex is to improve locality, while r_offset
		ikudrinUnsubmitted Done Reply Inline Actions Preferably, that should be a method in the `DynamicReloc` class. ikudrin: Preferably, that should be a method in the `DynamicReloc` class.
// is to make results easier to read.		// is to make results easier to read.
if (sort)		if (sort) {
parallelSort(		const RelType relativeRel = target->relativeRel;
relocs, [&](const DynamicReloc &a, const DynamicReloc &b) {		parallelSort(relocs, [&](const DynamicReloc &a, const DynamicReloc &b) {
return std::make_tuple(a.type != target->relativeRel,		return std::make_tuple(a.type != relativeRel, a.r_sym, a.r_offset) <
a.getSymIndex(symTab), a.getOffset()) <		std::make_tuple(b.type != relativeRel, b.r_sym, b.r_offset);
std::make_tuple(b.type != target->relativeRel,
b.getSymIndex(symTab), b.getOffset());
});		});
		}

for (const DynamicReloc &rel : relocs) {		for (const DynamicReloc &rel : relocs) {
encodeDynamicReloc<ELFT>(symTab, reinterpret_cast<Elf_Rela *>(buf), rel);		encodeDynamicReloc<ELFT>(reinterpret_cast<Elf_Rela *>(buf), rel);
buf += config->isRela ? sizeof(Elf_Rela) : sizeof(Elf_Rel);		buf += config->isRela ? sizeof(Elf_Rela) : sizeof(Elf_Rel);
}		}
}		}

template <class ELFT>		template <class ELFT>
AndroidPackedRelocationSection<ELFT>::AndroidPackedRelocationSection(		AndroidPackedRelocationSection<ELFT>::AndroidPackedRelocationSection(
StringRef name)		StringRef name)
: RelocationBaseSection(		: RelocationBaseSection(
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	bool AndroidPackedRelocationSection<ELFT>::updateAllocSize() {
// perform the initial adjustment).		// perform the initial adjustment).
add(relocs.size());		add(relocs.size());
add(0);		add(0);

std::vector<Elf_Rela> relatives, nonRelatives;		std::vector<Elf_Rela> relatives, nonRelatives;

for (const DynamicReloc &rel : relocs) {		for (const DynamicReloc &rel : relocs) {
Elf_Rela r;		Elf_Rela r;
encodeDynamicReloc<ELFT>(getPartition().dynSymTab, &r, rel);		r.r_offset = rel.getOffset();
		r.setSymbolAndType(rel.getSymIndex(getPartition().dynSymTab), rel.type,
		false);
		if (config->isRela)
		r.r_addend = rel.computeAddend();
		arichardsonUnsubmitted Not Done Reply Inline Actions Does it make sense to also parallelize this or is this not meaningful for the relocation counts encountered in practise? arichardson: Does it make sense to also parallelize this or is this not meaningful for the relocation counts…

if (r.getType(config->isMips64EL) == target->relativeRel)		if (r.getType(config->isMips64EL) == target->relativeRel)
relatives.push_back(r);		relatives.push_back(r);
else		else
nonRelatives.push_back(r);		nonRelatives.push_back(r);
}		}

llvm::sort(relatives, [](const Elf_Rel &a, const Elf_Rel &b) {		llvm::sort(relatives, [](const Elf_Rel &a, const Elf_Rel &b) {
▲ Show 20 Lines • Show All 2,114 Lines • Show Last 20 Lines