This is an archive of the discontinued LLVM Phabricator instance.

[ELF] - Implemented -z combrelocs/nocombreloc.
ClosedPublic

Authored by grimar on Apr 26 2016, 3:30 AM.

Details

Summary

This is the option which sorts relocs to optimize dynamic linker performance.
-z combelocs is the default in gold, also it ignores -z nocombreloc,
this patch do the same.

Patch sorts relocations by symbols only and do not create any
DT_REL[A]COUNT entries. That is different with what gold/bfd do.

More information about option is here:
http://www.airs.com/blog/archives/186
http://people.redhat.com/jakub/prelink.pdf, p.2

Diff Detail

Repository
rL LLVM

Event Timeline

grimar updated this revision to Diff 54981.Apr 26 2016, 3:30 AM
grimar retitled this revision from to [ELF] - Implemented -z combrelocs/nocombreloc..
grimar updated this object.
grimar added reviewers: ruiu, rafael.
grimar added subscribers: grimar, llvm-commits.
grimar updated this object.Apr 26 2016, 8:17 AM
ruiu edited edge metadata.Apr 26 2016, 8:46 PM

There are two ways of doing this. One way is, as you did, to sort symbols before writing it to the mmap'ed output. The other way is to write relocations to the buffer and sort the buffer as an array of RelTy. I think the latter is more efficient than the former because of better locality. Did you consider doing that way?

grimar updated this revision to Diff 55187.Apr 27 2016, 4:54 AM
grimar edited edge metadata.
  • Addressed Rui's comment.
In D19528#413335, @ruiu wrote:

There are two ways of doing this. One way is, as you did, to sort symbols before writing it to the mmap'ed output. The other way is to write relocations to the buffer and sort the buffer as an array of RelTy. I think the latter is more efficient than the former because of better locality. Did you consider doing that way?

I did, but idea of sorting something that was already written did not look clean for me, I did not think
about possible perfomance win though. So, probably you're right, updated the patch.

I have found -z nocombreloc useful to debug ld and ld.so.

I have found -z nocombreloc useful to debug ld and ld.so.

This can be really one-line-easy to add in lld, if restoring the original
order of relocations can be useful for something.
But again, gold does ignore that option, so it seems usecase
is very specific.

I have found -z nocombreloc useful to debug ld and ld.so.

This can be really one-line-easy to add in lld, if restoring the original
order of relocations can be useful for something.
But again, gold does ignore that option, so it seems usecase
is very specific.

Well, gold can't build ld.so on Linux.

grimar updated this revision to Diff 55234.Apr 27 2016, 8:16 AM
grimar updated this object.
  • Added -z nocombreloc support.

I have found -z nocombreloc useful to debug ld and ld.so.

This can be really one-line-easy to add in lld, if restoring the original
order of relocations can be useful for something.
But again, gold does ignore that option, so it seems usecase
is very specific.

Well, gold can't build ld.so on Linux.

So just in case, I added support of -z nocombreloc, since it anyways trivial and short in code,

ruiu added a comment.Apr 27 2016, 11:45 AM

It is generally looking good.

ELF/Driver.cpp
354 ↗(On Diff #55234)

Let's name ZCombreloc as this is a one-word option (if it were -z no-comb-reloc, then I'd name it ZCombReloc). Maybe we want to change ZExecStack to ZExecstack too.

ELF/OutputSections.cpp
306 ↗(On Diff #55234)
SHF_ALLOC), Sort(Sort) {
323–324 ↗(On Diff #55234)

Please use uint8_t instead of unsigned char.

329 ↗(On Diff #55234)

Remove else after return.

331 ↗(On Diff #55234)

Ditto.

333–334 ↗(On Diff #55234)

I'd use X and Y (or whatever) rather than IdxA and IdxB because they have very narrow scopes.

341–342 ↗(On Diff #55234)

You don't need these temporary variables because you can use A.r_offset and B.r_offset directly.

364–365 ↗(On Diff #55234)

Let's invert this condition and add comment before if about -z combreloc option.

// Sort relocations by r_offset, breaking ties with other conditions.
// Doing this does not change the semantics nor the file format, but
// some dynamic linker can load executables/DSOs faster when the
// relocation table is sorted. The fact that we have sorted relocations 
// is recorded using DT_RELACOUNT tag.
// http://people.redhat.com/jakub/prelink.pdf
643–644 ↗(On Diff #55234)

Swap the two ifs so that we don't call getRelativeCount if the feature is disabled.

Also, does it make sense to not emit DT_RELACOUNT if there's no relative relocations?

ELF/OutputSections.h
232 ↗(On Diff #55234)

I wouldn't use default argument lightly because it is virtually defining two different functions, with and without the parameter. I prefer passing the argument explicitly.

244 ↗(On Diff #55234)

NumRelativeRels may be better.

ELF/Writer.cpp
132 ↗(On Diff #55234)

I wouldn't define this because it is used only once.

grimar updated this revision to Diff 55393.Apr 28 2016, 3:06 AM
grimar marked 10 inline comments as done.
  • Addressed review comments.
ELF/Driver.cpp
354 ↗(On Diff #55234)

Done.

ELF/OutputSections.cpp
306 ↗(On Diff #55234)

Done, but I dont sure I like how it looks after clang-format.

323–324 ↗(On Diff #55234)

Done.

329 ↗(On Diff #55234)

Done.

331 ↗(On Diff #55234)

Done.

333–334 ↗(On Diff #55234)

Done.

341–342 ↗(On Diff #55234)

Done.

364–365 ↗(On Diff #55234)

Done, but:

  1. "The fact that we have sorted relocations is recorded using DT_RELACOUNT tag." was not quite correct, I removed it. DT_RELACOUNT just keeps amount of Relative relocations at start of rela.dyn. We can have zero of such relocations and DT_RELACOUNT entry will absent. But relocations will still be sorted. That how golds works, and it seems reasonable: there is no point in zero filled tag about relative relocations, we will still have some boost after sorting others one.
  2. Changed link to Ian Lance Tailor notes about it.
643–644 ↗(On Diff #55234)

DT_RELACOUNT == 0 just wastes file space and does not affect anything I think.
It looks like excessive job for lld to generate it and for dynamic linker to proccess the
empty record.
gold do the same, btw.

ELF/OutputSections.h
232 ↗(On Diff #55234)

Done.

244 ↗(On Diff #55234)

Done.

ELF/Writer.cpp
132 ↗(On Diff #55234)

Done.

grimar added a comment.EditedApr 29 2016, 5:52 AM

Below are first benchmark result I got:

used release shared build of clang:
-DCMAKE_BUILD_TYPE=Release -DLLVM_BUILD_LLVM_DYLIB=true -DBUILD_SHARED_LIBS=true
OS: Ubuntu 16.04 LTS 64bit.
Each time I run test 5 times and selected average by total startup time result:

// 1. With this full patch:
umb@ubuntu:~/umb/LLVM/self-host-my-shared/bin$ LD_DEBUG=statistics ./clang

42269:	
42269:	runtime linker statistics:
42269:	  total startup time in dynamic loader: 102686501 cycles
42269:		    time needed for relocation: 78691443 cycles (76.6%)
42269:	                 number of relocations: 15077
42269:	      number of relocations from cache: 14659
42269:	        number of relative relocations: 136769
42269:		   time needed to load objects: 22087834 cycles (21.5%)

// 2. Without this patch:
umb@ubuntu:~/umb/LLVM/self-host-orig-shared/bin$ LD_DEBUG=statistics ./clang

42265:	
42265:	runtime linker statistics:
42265:	  total startup time in dynamic loader: 151284557 cycles
42265:		    time needed for relocation: 127451126 cycles (84.2%)
42265:	                 number of relocations: 27492
42265:	      number of relocations from cache: 2244
42265:	        number of relative relocations: 2020
42265:		   time needed to load objects: 21858213 cycles (14.4%)

Amount of startup time cycles wo/patch patch is 1.47x in compare with w/patch.
Time needed for relocations is 1.61x.

// 3. With this patch, but do not set DT_RELACOUNT tag:
umb@ubuntu:~/umb/LLVM/self-hosting-notag/bin$ LD_DEBUG=statistics ./clang

101488:	
101488:	runtime linker statistics:
101488:	  total startup time in dynamic loader: 111007425 cycles
101488:		    time needed for relocation: 87371019 cycles (78.7%)
101488:	                 number of relocations: 15077
101488:	      number of relocations from cache: 14659
101488:	        number of relative relocations: 2020
101488:		   time needed to load objects: 21857215 cycles (19.6%)

~8% more startup cycles than (1).
~11% more relocation time cycles than (1).

Now I am going to test static -pie build.

Now I am going to test static -pie build.

And I was not able to perform this test. Self-hosting with -pie configuration fails because llvm-tblgen segfaults (without my patch).
Full configuration line: -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=/home/umb/umb/LLVM/build/bin/clang -DCMAKE_CXX_COMPILER=/home/umb/umb/LLVM/build/bin/clang++ -DLLVM_PARALLEL_COMPILE_JOBS=8 -DLLVM_ENABLE_THREADS=true -DCMAKE_CXX_FLAGS="-fuse-ld=lld -pie" -DCMAKE_C_FLAGS="-fuse-ld=lld -pie"

[169/2657 0.0/sec] Building Attributes.inc...
FAILED: include/llvm/IR/Attributes.inc.tmp 
cd /home/umb/umb/LLVM/no-tag-static/include/llvm/IR && /home/umb/umb/LLVM/no-tag-static/bin/llvm-tblgen -gen-attrs -I /home/umb/umb/LLVM/llvm/include/llvm/IR -I /home/umb/umb/LLVM/llvm/lib/Target -I /home/umb/umb/LLVM/llvm/include /home/umb/umb/LLVM/llvm/include/llvm/IR/Attributes.td -o /home/umb/umb/LLVM/no-tag-static/include/llvm/IR/Attributes.inc.tmp
Segmentation fault (core dumped)

Not sure, did anybody try -pie selfhosting before ?

But anyways previous benchmark results are probably enough for now to prove that patch is useful as is. Thoughts ?

rafael edited edge metadata.May 5 2016, 8:11 AM

Please do simplify the patch to first not set any new DT_ and sort only by offset. Lets benchmark just that first.

ELF/Writer.cpp
168 ↗(On Diff #55393)

Why can't you sort this? It shouldn't hurt when being lazy and it should help otherwise.

grimar added a comment.May 5 2016, 8:16 AM

Please do simplify the patch to first not set any new DT_ and sort only by offset. Lets benchmark just that first.

Ok, will try tomorrow.

ELF/Writer.cpp
168 ↗(On Diff #55393)

Good point.

grimar added a comment.May 6 2016, 4:38 AM

Please do simplify the patch to first not set any new DT_ and sort only by offset. Lets benchmark just that first.

I did that (simplified version I used is here: D20013) and got no any perfomance boost after that at all
(or very minor what I believe is just a computing error):

Middle from 5 runs is shown.
Without patch:

umb@ubuntu:~/umb/LLVM/self-hosting/bin$ LD_DEBUG=statistics ./clang
     46693:	
     46693:	runtime linker statistics:
     46693:	  total startup time in dynamic loader: 193739180 cycles
     46693:		    time needed for relocation: 166734504 cycles (86.0%)
     46693:	                 number of relocations: 27736
     46693:	      number of relocations from cache: 2247
     46693:	        number of relative relocations: 2020
     46693:		   time needed to load objects: 24664951 cycles (12.7%)
clang-3.9: error: no input files
     46693:	
     46693:	runtime linker statistics:
     46693:	           final number of relocations: 28487
     46693:	final number of relocations from cache: 2247

With patch (no tag, sort by offset).

umb@ubuntu:~/umb/LLVM/selfhost-reduced/bin$ LD_DEBUG=statistics ./clang
     46666:	
     46666:	runtime linker statistics:
     46666:	  total startup time in dynamic loader: 188386979 cycles
     46666:		    time needed for relocation: 161439636 cycles (85.6%)
     46666:	                 number of relocations: 27736
     46666:	      number of relocations from cache: 2247
     46666:	        number of relative relocations: 2020
     46666:		   time needed to load objects: 24821139 cycles (13.1%)
clang-3.9: error: no input files
     46666:	
     46666:	runtime linker statistics:
     46666:	           final number of relocations: 28487
     46666:	final number of relocations from cache: 2247

I think results are expected as symbols are not grouped together, what
looks to be the main condition for optimization to work.

I will try running benchmarks on a few variations of this.

What version is this based on? I can't get it to apply cleanly.

I was able to manually apply it.

What I did was:

  • build clang with shared libraries.
  • time with sudo schedtool -F -p 99 -a 0x4 -e perf stat -r 10 ./bin/clang
  • build the patched version of lld
  • delete bin/clang-3.9 and lib/*.so
  • link again
  • time the result

What I got was

  • trunk: 0.059005660
  • sort on RELATIVE, Symbol, offset, type: 0.040090757
  • sort on Symbol, offset, type: 0.042367124
  • sort on offset, type: 0.059119353
  • sort on RELATIVE, Symbol, offset: 0.040131764
  • sort on RELATIVE, Symbol: 0.040396584
  • sort on RELATIVE: 0.058322444
  • sort on Symbol: 0.039886032

So it looks like the only thing that makes a difference is the symbol.

BTW, one cannot sort .rel.plt, it would be nice to document why.
I think it is better to use stable_sort to be sure we have a
deterministic output without checking fields that don't make a
difference.

So a change sorting only by symbol with stable sort LGTM. Please make
sure you are not computing the number of relative rolocations or
setting any DT_*.

The partial apply that I used for benchmarking was.

Cheers,
Rafael

grimar added a comment.May 6 2016, 4:40 PM

I was able to manually apply it.

What I did was:

  • build clang with shared libraries.
  • time with

>sudo schedtool -F -p 99 -a 0x4 -e perf stat -r 10 ./bin/clang

  • build the patched version of lld
  • delete bin/clang-3.9 and lib/*.so
  • link again
  • time the result

What I got was

  • trunk: 0.059005660
  • sort on RELATIVE, Symbol, offset, type: 0.040090757
  • sort on Symbol, offset, type: 0.042367124
  • sort on offset, type: 0.059119353
  • sort on RELATIVE, Symbol, offset: 0.040131764
  • sort on RELATIVE, Symbol: 0.040396584
  • sort on RELATIVE: 0.058322444
  • sort on Symbol: 0.039886032

So it looks like the only thing that makes a difference is the symbol.

BTW, one cannot sort .rel.plt, it would be nice to document why.
I think it is better to use stable_sort to be sure we have a
deterministic output without checking fields that don't make a
difference.

So a change sorting only by symbol with stable sort LGTM. Please make
sure you are not computing the number of relative rolocations or
setting any DT_*.

The partial apply that I used for benchmarking was.

I'll update the patch a bit later then for final review.

At the same time I got 10% difference in startup cycles
when compared with DT_* vs without: http://reviews.llvm.org/D19528#416734
Do you think that was an computing error ?

grimar updated this revision to Diff 56707.May 10 2016, 6:27 AM
grimar edited edge metadata.
  • Sort only by symbols, using stable_sort.
  • Do not create any DT_entries (I am still that that reduced startup cycles amount by 10%, so going to introduce another patch after landing this and after retesting the perfomance results I got before).

BTW, one cannot sort .rel.plt, it would be nice to document why.

I also noticed that: when I sort the .rel.plt then produced lld linked binaries are broken (segfaults on startup,
not sure all of, but at least some of them do). Not sure I have any explanation for that now.

rafael accepted this revision.May 10 2016, 6:55 AM
rafael edited edge metadata.

LGTM

This revision is now accepted and ready to land.May 10 2016, 6:55 AM
This revision was automatically updated to reflect the committed changes.
grimar updated this object.May 10 2016, 8:54 AM
grimar edited edge metadata.
  • Sort only by symbols, using stable_sort.
  • Do not create any DT_entries (I am sure that that reduced startup cycles amount by 10%, so going to introduce another patch after landing this and after retesting the perfomance results I got before).

I retested case when we are creating DT_RELACOUNT vs case when we are not (what is currently commited). Tried many times,
but it seems that difference is just a calcualtion error.
I am not sure what caused 10% difference I observed before, currently I have no reproduce for that anymore.

I tried
sudo schedtool -F -p 99 -a 0x1 -e perf stat -r 1000 ./clang
as well as just LD_DEBUG=statistics ./clang many times

Numbers are about the same in average.

Changes I tested today were: