Page MenuHomePhabricator

[ThinLTO] Launch importing backends in parallel threads from gold plugin
ClosedPublic

Authored by tejohnson on Dec 9 2015, 11:07 AM.

Details

Summary

Instead of exiting after creating the combined index in the gold plugin,
unless requested by new option, we will now launch the ThinLTO backends
(LTO and codegen pipelines with importing) in parallel threads. The
number of threads is controlled by the existing -jobs gold plugin option,
or the hardware concurrency if not specified.

As discussed on IRC with Rafael, pull split codegen into gold-plugin and
use the ThreadPool support. Refactor both the split codegen and ThinLTO
handling to utilize a new CodeGen class that encapsulates the
optimization and code generation handling for each module (split or
not). This allows better code reuse between the ThinLTO and split
codegen cases. For now I have included this along with the ThinLTO thread patch, to
show how it all fits together. I can commit the split code gen changes
first though, followed by the ThinLTO backend support. Let me know if
you would like to review these separately.

Along with follow-on fixes D16173 and D16120, all of the SPEC cpu2006 C/C++ benchmarks now build and run correctly with -flto=thin.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
tejohnson added inline comments.Dec 17 2015, 11:16 AM
tools/gold/gold-plugin.cpp
65–66

Will do and upload the new one shortly.

66

I could and I considered that. As currently defined by gold it would save to memcpy. However, I thought it would be better to use a unique_ptr since it doesn't assume anything about the structure which isn't defined here, and it seemed clearer and cleaner to avoid copying. Note we pass a reference to this member to the ThreadPool::async to be used by the thread, that would have to be changed to a memcpy as well.

rafael added inline comments.Dec 17 2015, 1:35 PM
tools/gold/gold-plugin.cpp
66

OK. If we have a std::unique_ptr, we can use it instead of the Valid field, no? Valid is false iff File is null.

I got this warning:

/home/espindola/llvm/llvm/tools/gold/gold-plugin.cpp:133:17: warning: private field 'F' is not used [-Wunused-private-field]

claimed_file *F
tools/gold/gold-plugin.cpp
943

If you change codegenImpl to take an ArrayRef you don't have to do this.

I got this warning:

/home/espindola/llvm/llvm/tools/gold/gold-plugin.cpp:133:17: warning: private field 'F' is not used [-Wunused-private-field]

claimed_file *F

Will fix. Looks like there are some stale comments about using this class for the join, which isn't necessary after switching to the ThreadPool. Will clean that up.

tools/gold/gold-plugin.cpp
66

Good point, will fix this.

943

Ok, will change.

We should probably refactor splitCodeGen. It is odd that now we have
two parallel codegen paths. With thinLTO we already multiple BC files,
so it should probably look something like

if (SplitForParallelCodeGen)

ProduceMultipleModules();

Create the tasks.

Each task handles one bc file, which may be one of the original ones
if using thinLto or one of the split ones.

tejohnson updated this revision to Diff 43184.Dec 17 2015, 2:19 PM
  • Address review comments/suggestions

We should probably refactor splitCodeGen. It is odd that now we have
two parallel codegen paths. With thinLTO we already multiple BC files,
so it should probably look something like

if (SplitForParallelCodeGen)

ProduceMultipleModules();

Create the tasks.

Each task handles one bc file, which may be one of the original ones
if using thinLto or one of the split ones.

This will require some refactoring of SplitModule() as well, which currently takes a callback (that actually creates each thread) and does the module splitting. For the case where we don't want multiple split modules, like in ThinLTO, we simply pass a single output stream. Note that in both the split and non-split case the same codegen() routine is called to do the actual codegen part.

I think I've addressed all of your other comments. PTAL. Thanks!

rafael added inline comments.Dec 22 2015, 12:21 PM
tools/gold/gold-plugin.cpp
859

splitCodeGen can take a ArrayRef too. Why do you need the vec?

946

It seems odd how much work the destructor of TaskInfo is doing.

Most of the work is here because gold is not thread safe, correct? It so, it seems better to write this code explicitly after

ThinLTOThreadPool.wait();
969

Why do you need a worklist?

Can't you just just a simple loop over Modules?

1004

This can be just

Tasks.emplace_back(new TaskInfo(std::move(InputFile), std::move(OS),
                                NewFilename.c_str(), TempOutFile));

Per IRC discussion, will do some refactoring of splitCodeGen next, then subsequently rebase this patch on top of that. But I wanted to reply to the latest comments here and upload a new patch that addresses them first.

tools/gold/gold-plugin.cpp
859

Ah ok, fixed.

946

Changed TaskInfo::~TaskInfo into TaskInfo::cleanup and invoked explicitly on each task after the wait().

969

I think this was leftover from my original pre-ThreadPool implementation. Good point that it isn't needed. Updated to iterate over Modules as suggested.

1004

Fixed

tejohnson updated this revision to Diff 43489.Dec 22 2015, 3:34 PM
  • Address latest feedback.
tejohnson updated this revision to Diff 43745.Dec 29 2015, 10:17 AM

As discussed on IRC with Rafael, pull split codegen into gold-plugin and
use the ThreadPool support. Refactor both the split codegen and ThinLTO
handling to utilize a new CodeGen class that encapsulates the
optimization and code generation handling for each module (split or
not). This allows better code reuse between the ThinLTO and split
codegen cases.

For now I have included this along with the ThinLTO thread patch, to
show how it all fits together. I can commit the split code gen changes
first though, followed by the ThinLTO backend support. Let me know if
you would like to review these separately.

mehdi_amini added inline comments.Jan 2 2016, 7:28 PM
include/llvm/Support/thread.h
60 ↗(On Diff #43184)

Can be committed separately I think.

Ping.

include/llvm/Support/thread.h
60 ↗(On Diff #43745)

Will do.

tejohnson updated this revision to Diff 44824.Jan 13 2016, 6:19 PM
  • Rebase and improve -save-temps behavior with ThinLTO
tejohnson updated this object.Jan 13 2016, 8:15 PM

Ping.

Using this support extensively in my own ThinLTO spec testing. Would be great to get this reviewed and in tree. =)

Note it involves some refactoring of the split codegen path as suggested by Rafael on IRC (see the comment history for details, specifically Dec 29 update).

mehdi_amini edited edge metadata.Jan 28 2016, 11:14 AM

I'm not familiar with Gold, but here are a few minor comments

tools/gold/gold-plugin.cpp
66–67

(same =default here)

78

Any difference with PluginInputFile(PluginInputFile &&RHS) = default; ?

830

Note: you could reuse the TargetMachine for the next module processed by this thread.

961

There is a bunch of duplicated code above (used in regular LTO as well I think)

In D15390#338559, @joker.eph wrote:

I'm not familiar with Gold, but here are a few minor comments

Thanks for the comments!

tools/gold/gold-plugin.cpp
66–67

Ditto.

78

Good point, will change to default

830

That's an interesting idea. But I don't think I have any ability to control this once I send tasks to the thread pool. Is there a good way to share things across tasks assigned to the same thread by the pool?

961

True, the LTO handling in allSymbolsReadHook does some of the same things. But the LLVMContext and IRMover constructors are outside the loop over the modules since they can be shared in that case. And the invocation of getModuleForFile is a bit different. I could probably create a helper that does the getModuleForFile, setting of the target triple, and invoke IRMover::move though. I'm not sure if that ends up being clearer, but let me see what I could do here.

tejohnson updated this revision to Diff 46408.Jan 29 2016, 12:34 PM
tejohnson updated this object.
tejohnson edited edge metadata.

Address review comments. Use default move constructors, and refactor
common code into a helper.

Some more comment.

tools/gold/gold-plugin.cpp
815

It took me some time to understand what was going on, this "recursive" use of the CodeGen class can be confusing. As long as it is limited to this file I won't object.

830

Yeah is it annoying, in my local implementation I store a "per thread context" in a global map protect retrieving the Context with a mutex.
(I'm not asking you to do the same here and now)

909

Could this be done in the TaskInfo dtor?

939

Usually I prefer RAII (i.e. using a new scope).

1017

This could be

std::vector<ThinLTOTaskInfo> Tasks;
Tasks.reserve(Modules.size());

(same above for std::vector<std::unique_ptr<TaskInfo>> Tasks; around line 1023)

tejohnson added inline comments.Feb 1 2016, 4:17 PM
tools/gold/gold-plugin.cpp
815

Yeah, it was unfortunatly hard to get the refactoring and code sharing between the different modes without doing this. So I tried to document it as well as I could.

909

I previously had it there, but Rafael thought the dtor was too heavy-weight and wanted it more explicit. =)

939

Oh I see, the wait() is unnecessary if I provoke the ThreadPool destructor via RAII. Will do that here and for the ThinLTO thread pool as well.

1017

Ok

tejohnson updated this revision to Diff 46593.Feb 1 2016, 4:36 PM

Address more review comments: Use RAII on ThreadPool instead of expicit
wait(), and reserve TaskInfo vectors rather than emplacing unique_ptrs.

rafael added inline comments.Feb 9 2016, 4:54 PM
tools/gold/gold-plugin.cpp
876

s/thread/task/

925

It is nice that now we always use a ThreadPool.

It would be awesome if this could be refactored so that there was just one ThreadPool for thinlto and conventional parallel codegen.

tejohnson added inline comments.Feb 10 2016, 8:10 AM
tools/gold/gold-plugin.cpp
876

Fixed here and a couple other places.

925

The task type is different, and the iteration/handling to add tasks is different - I'm not sure how much code reuse we would get by sharing the thread pool creation and management code. The code to create the thread pool and insert into it is pretty minimal by itself. Also note that you never are using both thread pools in a single compilation.

tejohnson updated this revision to Diff 47458.Feb 10 2016, 8:12 AM
  • s/thread/task/ in a couple places
pcc added inline comments.Feb 12 2016, 12:24 PM
tools/gold/gold-plugin.cpp
915

I don't think thread pools are necessary for split code gen, as we can already perfectly assign the right amount of work to individual threads. Also, this implementation loses the pipelining feature from the original code (i.e. worker threads can work on codegen'ing while the main thread is still splitting). I would prefer you to use the existing implementation in llvm/CodeGen/ParallelCG.h.

mehdi_amini added inline comments.Feb 12 2016, 2:07 PM
tools/gold/gold-plugin.cpp
915

The thread that is doing the splitting can issue other jobs to the thread pool, providing the desired pipeline.

Just fuse the loop body below within the lambda...

mehdi_amini added inline comments.Feb 12 2016, 2:10 PM
tools/gold/gold-plugin.cpp
915

I'll add that while the pooling is not necessary if you only queue as many jobs as you have threads, it is not a reason by itself not to use it: the paradigm is fairly clear, and it decouples the actual splitting granularity from the number of actual worker threads, allowing to experiment with different numbers for each (providing better pipelining for instance).

pcc added inline comments.Feb 12 2016, 2:22 PM
tools/gold/gold-plugin.cpp
915

Yes, but the current implementation doesn't need any of that. If we experimentally find that decoupling would provide some benefit, then by all means we can start using thread pools here.

In any case, if there is a compelling reason to use thread pools, the right place to make the change is in lib/CodeGen/ParallelCG.cpp rather than in a duplicate implementation here. We can defer what the design for that should look like simply by not using thread pools yet.

tejohnson updated this revision to Diff 49061.Feb 25 2016, 7:41 AM

Modify the patch to implement what is hopefully a compromise solution on
split code gen. I modified lib/CodeGen/ParallelCG.cpp to use a
ThreadPool, and go back to invoking it from the gold plugin.

This has a few nice effects:

  • ThreadPool used by all ParallelCG consumers.
  • Restores the pipelining of splitting and codegen (although note that with a tweak to the old version of this thread that this could be attained in the gold-plugin implementation as well).
  • Avoids the recursive construction of the CodeGen object on the split code gen path.

Can one of you take a look and see if this is acceptable, and if so and
there are not other comments, mark it accepted?

tejohnson updated this revision to Diff 49078.Feb 25 2016, 9:05 AM

Update a comment to match new version. Also, rename the CodeGen Filename
member to SaveTempsFilename to make it clearer and disambiguate from
places that use Filename as a local var, and initialize it as expected
for ThinLTO. Found this issue while testing changes to dependent patch
D16173.

pcc edited edge metadata.Feb 26 2016, 11:46 AM

Seems reasonable to me. Mehdi?

Great, thanks. Do either of you have any other comments or if not can one of you mark this accepted?

pcc accepted this revision.Mar 3 2016, 10:57 AM
pcc edited edge metadata.

LGTM

tools/gold/gold-plugin.cpp
864–865

I don't think this should be dependent on a property of the host machine, as there are behavioral differences between parallelism levels (e.g. symbol ordering will be different, and some uses of inline asm won't work with parallelism >1, although some of that is arguably a bug). Can you please update the comment to reflect that?

This revision is now accepted and ready to land.Mar 3 2016, 10:57 AM
In D15390#367460, @pcc wrote:

LGTM

Thanks!

tools/gold/gold-plugin.cpp
864–865

Ok, will do.