This is an archive of the discontinued LLVM Phabricator instance.

[ELF] - Implemented --section-ordering-file option.
AbandonedPublic

Authored by grimar on Oct 19 2016, 4:44 AM.

Details

Summary

Gold has support for following option:

--section-ordering-file FILENAME

Layout sections in the order specified.

It uses ordering file to layout input sections in the given order.

Original gold commit:
https://sourceware.org/ml/binutils/2010-06/txt00000.txt

Usings in the wild I found:
https://glandium.org/blog/?p=2467
https://reviews.facebook.net/D22713

Diff Detail

Event Timeline

grimar updated this revision to Diff 75128.Oct 19 2016, 4:44 AM
grimar retitled this revision from to [ELF] - Implemented --section-ordering-file option..
grimar updated this object.
grimar added reviewers: ruiu, rafael, davide.
grimar updated this object.
grimar added subscribers: grimar, llvm-commits, evgeny777.
grimar updated this revision to Diff 75134.Oct 19 2016, 6:18 AM
  • Corrected naming of things, minor cleanups.
grimar updated this revision to Diff 75141.Oct 19 2016, 6:50 AM
  • Implemented support of combination with linkerscript.
grimar updated this object.Oct 19 2016, 8:50 AM
ruiu edited edge metadata.Oct 19 2016, 10:46 AM

Is this effort coordinated with Davide?

davide edited edge metadata.Oct 19 2016, 11:00 AM
In D25766#574368, @ruiu wrote:

Is this effort coordinated with Davide?

Yes! Thanks George for taking care of this.
BTW, I think the hard part of this feature is trying to evaluate and understand if it's worth it.
You may want to do some benchmarking/write a systap/dtrace script to collect data and feed into the linker and make sure this has some real impact. I recommend to use bare metal for testing and not VM as the results might be unreliable.
Rui, Rafael, what do you think?

In D25766#574368, @ruiu wrote:

Is this effort coordinated with Davide?

Yes! Thanks George for taking care of this.
BTW, I think the hard part of this feature is trying to evaluate and understand if it's worth it.
You may want to do some benchmarking/write a systap/dtrace script to collect data and feed into the linker and make sure this has some real impact. I recommend to use bare metal for testing and not VM as the results might be unreliable.
Rui, Rafael, what do you think?

After today investigations I found next more or less known ways to produce such file.

  • First is use of valgring + icegrind plugin:

https://blog.mozilla.org/tglek/2010/04/07/icegrind-valgrind-plugin-for-optimizing-cold-startup/
https://glandium.org/blog/?p=1008
Method looks a bit outdated though.

As far I understand along with job it do, it also should produce final_layout.txt file where sections should be dumped in a correct order,
and we should be able to reuse that file for LLD needs.

I am not sure how much easy can be to reuse it for any other applications.

I am going to check gcc way tomorrow, it seems to me as most easy possible path.

Short update: I was able to use gcc to extract section ordering files. For start I generated order list for clang launch without any arguments.
I used gcc from svn://gcc.gnu.org/svn/gcc/branches/google/gcc-4_9.
For doing these next steps were performed:

  1. Build llvm with:

-DCMAKE_BUILD_TYPE=Release -DLLVM_PARALLEL_COMPILE_JOBS=8 -DLLVM_ENABLE_THREADS=true -DCMAKE_CXX_FLAGS="-fPIC -std=c++11 -ffunction-sections -fdata-sections -fprofile-generate=/home/umb/LLVM/gcda" -DCMAKE_C_FLAGS="-fPIC -ffunction-sections -fdata-sections -fprofile-generate=/home/umb/LLVM/gcda" -DCMAKE_C_COMPILER=/home/umb/gcc49google/bin/gcc -DCMAKE_CXX_COMPILER=/home/umb/gcc49google/bin/g++

  1. Run binary in the way we are interested in. It was ./clang w/o arguments in my case (.gcda files should be produced after that).
  1. Now rebuild llvm with -fprofile-use and use of .gcda files we obtained earlier. Use gold linker.

-DCMAKE_BUILD_TYPE=Release -DLLVM_PARALLEL_COMPILE_JOBS=8 -DLLVM_ENABLE_THREADS=true -DCMAKE_CXX_FLAGS="-B/usr/local/bin -fPIC -std=c++11 -ffunction-sections -fdata-sections -fprofile-dir=/home/umb/LLVM/gcda -fprofile-use -freorder-functions=callgraph -Wl,--plugin-opt,file=/home/umb/LLVM/order.txt" -DCMAKE_C_FLAGS="-B/usr/local/bin -fPIC -ffunction-sections -fdata-sections -fprofile-dir=/home/umb/LLVM/gcda -fprofile-use -freorder-functions=callgraph -Wl,--plugin-opt,file=/home/umb/LLVM/order.txt" -DCMAKE_C_COMPILER=/home/umb/gcc49google/bin/gcc -DCMAKE_CXX_COMPILER=/home/umb/gcc49google/bin/g++

  1. Remove order.txt and relink clang, that will create order.txt containing sections order for its executable.

I shared it here: https://drive.google.com/file/d/0B_OWr6ld9gUmd1hnV1k4Z25NQVk/view?usp=sharing
It is 18 megabytes in size.

Order list file contains next sections:
.text.unlikely.*
.text.exit.*
.text.startup.*
.text.hot.*

BTW, both gold and ld recognize suzh sections by name and use names as hint for sorting (no matter was order file used or not).
I did not check how it works, but that is what I saw in bfd/gold source code.
Do we want to implement the same for compatibility ?

Finally, now when I have such file I am going to write a tool or script to prepare it to be used as linker input ordering file (looks it contains excessive data, I'll check) and run some tests.
First one will be estimate clang launch time with sections ordered and without. I am thinking about other possible ones, may be compile simple helloworld ?

I did lot of testing last days and unfortunatly I still can't find a proper way to generate "good" order list of sections
to demonstrate the positive result.
But I can prove that ordering of sections defenetely matters.

When I start lld linked clang under perf, with this patch applied and some empty order file:
perf stat ./clang-4.0 -help

Performance counter stats for './clang-4.0 -help':
         60.445699      task-clock (msec)         #    0.860 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               889      page-faults               #    0.015 M/sec                  
...

Now if I change ordering funtion to return random:

int elf::getSectionFileOrder(StringRef S) {
    return rand() % INT32_MAX;
}
I have:
Performance counter stats for './clang-4.0 -help':
         26.831371      task-clock (msec)         #    0.742 CPUs utilized          
                 2      context-switches          #    0.075 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
             1,963      page-faults               #    0.073 M/sec                  
....

So I observed major slowdown, just because of reordering sections. I think that can probably work as a prove that building
proper ordering file can help to boost the startup of application in theory.

Unfortunately I was unable to generate ordering file that works better than empty or absence of one. I tried to use next method for that:

  1. Generated a list of pagefaults: perf trace -F all -o test.txt ./clang-4.0 -help
  2. Applied filter: grep -E '(clang-4.0/57467 minfault)' test.txt. That produces a list of faults of next view:
0.092 ( 0.000 ms): clang-4.0/57467 minfault [dl_main+0x54d] => /home/umb/LLVM/build_self/bin/clang-4.0@0x40 (d.)
0.097 ( 0.000 ms): clang-4.0/57467 minfault [dl_main+0x6b1] => /home/umb/LLVM/build_self/bin/clang-4.0@0x881df50 (d.)
0.102 ( 0.000 ms): clang-4.0/57467 minfault [_dl_setup_hash+0x10] => /home/umb/LLVM/build_self/bin/clang-4.0@0x23455a8 (d.)
0.120 ( 0.000 ms): clang-4.0/57467 minfault [strlen+0x26] => /home/umb/LLVM/build_self/bin/clang-4.0@0x2a6ebcf (d.)
0.143 ( 0.000 ms): clang-4.0/57467 minfault [dl_main+0x1aab] => /home/umb/LLVM/build_self/bin/clang-4.0@0x881e128 (d.)
0.153 ( 0.000 ms): clang-4.0/57467 minfault [strchr+0x23] => /home/umb/LLVM/build_self/bin/clang-4.0@0x23f5363 (d.)
  1. Generated list of clang binary symbols: readelf -W -s clang-4.0 > symbols.txt
  2. Using self written tool generated a order list of sections (it take offset of each page fault, finds proper symbol name and just attach

prefixes like ".text." etc). I posted the result just for reference: https://justpaste.it/zrbs

I checked that sorting really works. But result was worse than without ordering file. Partially because
if getSectionFileOrder() which returns 0 by default, that is wrong I think, so I made it to return INT32_MAX (to place unlisted sections after listed ones).
After that change there is no difference with and without use of ordering file. Even the page faults looks to be almost the same.

Posted timing for first case looks like a mistake, raw average results are:

18 - 23 msec and about 900 page faults for empty file and
25 - 30 msec , and about 1,960 page faults for random sections order.

So about 2x more page faults and about 1.3x more time in average.

When I start lld linked clang under perf, with this patch applied and some empty order file:
perf stat ./clang-4.0 -help

I'm very skeptical this is a decent testcase, at least alone. ./clang --help takes no time to start up, so, even if you show there's a speed-up it's very hard to understand if it's a real improvement or just noise.
I recommend to try a class of application where load time is a dominant factor, e.g. browsers. I would claim trhat Firefox or Chromium are decent candidates. YMMV.

D26130 was landed. I guess we do not need this one anymore now ?

grimar abandoned this revision.Nov 11 2016, 12:22 AM

D26130 was landed as alternative solution.