This is an archive of the discontinued LLVM Phabricator instance.

[ELF] - Use multithreading to build .gdb_index
AbandonedPublic

Authored by grimar on May 12 2017, 3:06 AM.

Details

Reviewers
ruiu
rafael
Summary

Currently LLD uses LLVM DWARF parsers to scan objects
to find information required for building .gdb_index.

There are 2 known possible ways to speedup building index.
One of them is to use relocated output to produce .gdb_index (D31424),
that saves time because DWARF parsers don't need to apply some
amount of relocations on their side.

But way above looks does not allow to parallel generation. This
patch suggests to use parrallel_for loop when we scan input objects.
That gives significant speedup itself. And if we also land D32853 and D31136,
.gdb_index feature will be really much faster.

Numbers for this patch are below. I am using i7 4790K (4 phys cores, @4.0), ssd, 10 runs for each test.
Tried to link debug llc binary using LLD.

no --gdb-index option:
4,044980833 seconds time elapsed ( +- 0,45% )

with --gdb-index, without patch
18,688194986 seconds time elapsed ( +- 0,50% )

with --gdb-index, with patch
8,578650449 seconds time elapsed ( +- 0,32% )

That means gdb_index building time without patch is 18.688s - 4.044s = 14.664s.
Time after patch is 8.578s - 4.044s = 4.534s.
After patch we are 3.234x faster for my configuration.

Diff Detail

Event Timeline

grimar created this revision.May 12 2017, 3:06 AM
grimar edited the summary of this revision. (Show Details)May 12 2017, 3:07 AM

I'll leave most of the review to Rui et, al. Just a drive by glance.

ELF/SyntheticSections.cpp
1772–1774

Might make more sense to have one container containing a struct with 3 members, than 3 parallel containers like this?

grimar added inline comments.May 12 2017, 10:46 AM
ELF/SyntheticSections.cpp
1772–1774

May be, I just tried to do as minimum changes as possible in current structure, to make initial review proccess be easier,
also assuming that if whole idea is OK we will came to something more ideal during reviews.

So I would be happy to start from general question for all reviewers - is this whole idea looks fine ?
And then we can polish it together if so :)

ruiu edited edge metadata.May 12 2017, 11:23 AM

This looks mostly fine, but I wouldn't submit this patch at the moment, as I think the code to generate .gdb_index is not optimize enough for single-thread yet. Our general strategy is to focus on single-thread performance and then use multi-threads as a final shot. I believe this strategy is working well, because (a) using multi-threads may hide real problems if used too early that are obvious when run with only one thread (which would result in making the linker a so-so performance), and (b) thinking about multi-threading is distracting when optimizing code.

Do you think you can't make it any faster without using multi-threading? I instinct is that the performance we can reach without multi-threading is higher than it is now.

In D33122#753528, @ruiu wrote:

This looks mostly fine, but I wouldn't submit this patch at the moment, as I think the code to generate .gdb_index is not optimize enough for single-thread yet. Our general strategy is to focus on single-thread performance and then use multi-threads as a final shot. I believe this strategy is working well, because (a) using multi-threads may hide real problems if used too early that are obvious when run with only one thread (which would result in making the linker a so-so performance), and (b) thinking about multi-threading is distracting when optimizing code.

Do you think you can't make it any faster without using multi-threading? I instinct is that the performance we can reach without multi-threading is higher than it is now.

Sure I think I can. I am pretty sure that for example something I implemented in D32853 was a right general direction and if you agree with whole idea of this patch then I'll abandon D32853 and going to suggest a patch for DWARF parsers instead of D32853.
Another patch - D31136 had LGTM today and it opens road to speedup how relocations are handled on DWARF parsers side too. I am pretty sure that whole way is probably more promising then way of using relocated content for generation of index.

And at the same time I would really want to ask you to land this patch too, because it is usefull for comparsion purposes and does not interfere with idea to optimize single threaded solution, but raises whole
target perfomance on a new level and gives some point we can reference to.

What about landing this one with single trhreaded loop instead of parallelForEachN ? This gives ability to have multithreaded design "in mind" and still focus on a single thread ?

ruiu added a comment.May 12 2017, 12:55 PM

And at the same time I would really want to ask you to land this patch too, because it is usefull for comparsion purposes and does not interfere with idea to optimize single threaded solution, but raises whole
target perfomance on a new level and gives some point we can reference to.

What do you compare with what? If you compare a multi-threaded version against a single-threaded version, it's likely that the former is faster, but it's no surprise and not a fair comparison. That is exactly I meant by enabling multi-threading too early is distracting.

In D33122#753706, @ruiu wrote:

And at the same time I would really want to ask you to land this patch too, because it is usefull for comparsion purposes and does not interfere with idea to optimize single threaded solution, but raises whole
target perfomance on a new level and gives some point we can reference to.

What do you compare with what? If you compare a multi-threaded version against a single-threaded version, it's likely that the former is faster, but it's no surprise and not a fair comparison. That is exactly I meant by enabling multi-threading too early is distracting.

It is interesting to compare LLD with gold for example for the same configuration to see whole link time difference and also probably useful to have multithreaded version under hand to check numbers from time to time and how changes
in code affect scalability. Globally working on single thread perfomance is othogonal task I think. This patch does not help and does not interfere with work on that (probably).

My main reason why I would want to land it is that will be easier to work on other patches, since this one changes internal structures which other patches may depend on.
As I mentioned - we can land single threaded loop and continue keeping builder single threaded. It should not be distracting for optimizations, because the whole structure of index building code
is straightforward, most of job done in separated static methods or inside DWARF parsers, so I would not expect it will be distracting.

At the same time I think I understand your position good enough. I am happy that this patch seems is a right direction and that opens road for implementing possible optimizations I have in mind.
So if you have strong feel it is early for it, lets place it on hold, I am fine with that too.

ruiu added a comment.May 16 2017, 11:17 AM

I think I'm not convinced. Using multi-threading when you are optimizing it for single-thread is distracting and increases deviation when you measure its performance. Please keep it as your local patch if you need it.

If you just replace parallelForEachN with std::for_each, the resulting code would look awkward, as this code was written with thread-safety in mind. If you can make this patch more natural, it may be ok to land it, though. So if you wish to land something like this, please update this without parallelForEachN and upload a new patch.

grimar abandoned this revision.May 29 2017, 2:01 AM

In favor of D33552