This is an archive of the discontinued LLVM Phabricator instance.

[ELF] Optimize RelocationSection<ELFT>::writeTo

Authored by MaskRay on Dec 18 2021, 11:25 AM.



When linking a 1.2G output (nearly no debug info, 2846621 dynamic relocations) using --threads=8, I measured

9.131462 Total ExecuteLinker
1.449913 Total Write output file
1.445784 Total Write sections
0.657152 Write sections {"detail":".rela.dyn"}

This change decreases the .rela.dyn time to 0.25, leading to 4% speed up in the total time.

  • The parallelSort is slow because of expensive r_sym/r_offset computation. Cache the values.
  • The iteration is slow. Move r_sym/r_addend computation ahead of time and parallelize it.

With the change, the new encodeDynamicReloc is cheap (0.05s). So don't parallelize it.

Diff Detail

Event Timeline

MaskRay created this revision.Dec 18 2021, 11:25 AM
MaskRay requested review of this revision.Dec 18 2021, 11:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 18 2021, 11:25 AM
MaskRay added a comment.EditedDec 18 2021, 6:57 PM

I have tried a serial radix sort whose performance is similar to the parallel qsort.
It has more code, has an allocation, and needs to sacrifice the D62141 ordering property (the property doesn't matter and can be sacrificed if justified):

// Sort by (!IsRelative,SymIndex). DT_REL[A]COUNT requires us to
// place R_*_RELATIVE first. SymIndex is to improve locality.
if (sort) {
  unsigned num = symTab->getNumSymbols();
  SmallVector<unsigned, 0> cnt(num);
  for (const DynamicReloc &rel : relocs)
  for (unsigned j = 0, i = 0; i != num; ++i) {
    unsigned t = j + cnt[i];
    cnt[i] = j;
    j = t;
  auto tmp = relocs;
  unsigned j = 0, k = relocs.size();
  for (const DynamicReloc &rel : relocs)
    tmp[cnt[rel.r_sym]++] = rel;

  j = 0;
  for (const DynamicReloc &rel : tmp)
    if (rel.type == target->relativeRel)
      relocs[j++] = rel;
      relocs[--k] = rel;
  std::reverse(relocs.begin() + k, relocs.end());

No objections from me. It may be worth getting some figures from the lld speed test ( as well, although I expect they won't show regressions.

Will be out of office for the next couple of weeks so may be a bit slow to respond. I'm happy for others to comment/approve.


Just thinking about making the enum size uint8_t Not entirely sure it will make a lot of difference in this case, but you may want to reorder some of the fields so that they minimise padding between items. For example addend before r_sym. Kind may also benefit from being last.


Preferably, that should be a method in the DynamicReloc class.


RelType is uint32_t, so there is no padding. Moving Kind at the end (as well as making it uint8_t) changes nothing as the whole class should be 64-bit aligned and thus is 64-bytes long regardless the changes.


I do not personally like optimizations when the meaning of the field is changed in different phases. That makes the code fragile and hard to support.

At least, when addend is updated, kind should be adjusted, or else the object's state becomes inconsistent.

MaskRay updated this revision to Diff 395578.Dec 20 2021, 7:29 PM
MaskRay marked an inline comment as done.

address comments

MaskRay marked an inline comment as done.Dec 20 2021, 7:29 PM
MaskRay added inline comments.

Dropped the change to Kind.

Moved type before r_sym because r_offset/r_sym/type/addend layout is more similar to Elf64_Rela.

ikudrin added inline comments.Dec 21 2021, 8:38 AM

And now we have an inconsistency if kind was AddendOnlyWithTargetVA or AgainstSymbolWithTargetVA because sym is not nullptr and an assert() in computeAddend() can trigger. Maybe add a new kind PostComputeRaw and add asserts to computeAddend() and needsDynSymIndex() to ensure that they are not called after computeRaw()?

MaskRay added inline comments.Dec 21 2021, 9:12 AM

Every redundant operation costs. We have comprehensive tests so I do not worry much about inconsistency caused problems. Adding PostComputeRaw seems to sacrifice too much for the check.

ikudrin accepted this revision.Dec 21 2021, 9:33 AM


This revision is now accepted and ready to land.Dec 21 2021, 9:33 AM
This revision was automatically updated to reflect the committed changes.

This looks like a nice improvement to me. I also like @ikudrin's suggestion of validating internal consistency in debug builds, but if this has a measurable performance impact the current patch LGTM.


Not sure if that makes any difference in a release build since we already have the store in computeRaw().
One thing that might also make it more resilient to bugs introduced while refactoring (or to help modified downstream consumers with less comprehensive test coverage) would be to keep the addend field private and add a r_addend() accessor that asserts that we are in the PostComputeRaw/ContentsFinalized/SomeOtherName state.


Does it make sense to also parallelize this or is this not meaningful for the relocation counts encountered in practise?