This is an archive of the discontinued LLVM Phabricator instance.

[ThinLTO] Launch importing backends in parallel threads from gold plugin
ClosedPublic

Authored by tejohnson on Dec 9 2015, 11:07 AM.

Details

Summary

Instead of exiting after creating the combined index in the gold plugin,
unless requested by new option, we will now launch the ThinLTO backends
(LTO and codegen pipelines with importing) in parallel threads. The
number of threads is controlled by the existing -jobs gold plugin option,
or the hardware concurrency if not specified.

As discussed on IRC with Rafael, pull split codegen into gold-plugin and
use the ThreadPool support. Refactor both the split codegen and ThinLTO
handling to utilize a new CodeGen class that encapsulates the
optimization and code generation handling for each module (split or
not). This allows better code reuse between the ThinLTO and split
codegen cases. For now I have included this along with the ThinLTO thread patch, to
show how it all fits together. I can commit the split code gen changes
first though, followed by the ThinLTO backend support. Let me know if
you would like to review these separately.

Along with follow-on fixes D16173 and D16120, all of the SPEC cpu2006 C/C++ benchmarks now build and run correctly with -flto=thin.

Diff Detail

Event Timeline

tejohnson updated this revision to Diff 42323.Dec 9 2015, 11:07 AM
tejohnson retitled this revision from to [ThinLTO] Launch importing backends in parallel threads from gold plugin.
tejohnson updated this object.
tejohnson added subscribers: llvm-commits, davidxl.
tejohnson updated this revision to Diff 42332.Dec 9 2015, 1:26 PM

Rebase to pick up changes I committed separately. The changes in this patch are now just related to the ThinLTO thread support.

mehdi_amini added inline comments.Dec 9 2015, 1:31 PM
lib/Transforms/IPO/FunctionImport.cpp
272 ↗(On Diff #42323)

Const correctness is great, please commit now separately.

tools/gold/gold-plugin.cpp
1000

We should probably refactor this out of the plugin, but this can be done later.

1039

This is very suboptimal, I don't mind if you want to get this in for now as it is limited to the gold plugin.

I plan to submit a ThreadPool to LLVMSupport (I'm using it locally in my bringup of the ld64 plugin).

Thanks for the comments.

lib/Transforms/IPO/FunctionImport.cpp
272 ↗(On Diff #42323)

Done already, just rebased this to pick it up.

tools/gold/gold-plugin.cpp
1000

Right, I am wondering how much overlap there is with the support you were adding for ld64. Would putting it in libLTO be useful?

1039

Right, part of the reason I wanted to send this right away was to see if there was something existing or under development so that I didn't have to reinvent the wheel. Glad to hear you had a similar need for this. Do you expect it to go in soon?

tejohnson updated this revision to Diff 42520.Dec 11 2015, 7:18 AM

Biggest change in this update is due to gold not being thread-safe
by default, requiring lots of refactoring to make gold callbacks in
single-threaded mode.

Also changed from std::thread to the thread wrapper in LLVM, which
handles !LLVM_ENABLE_THREADS.

Some test updates to test both single and multi-threaded handling.

I have not refactored any of the code out of gold yet. Doing so will
require refactoring out other routines, such as codegenImpl and
its callees such as runLTOPasses, or invoking via a callback.
There is some overlap between the handling in these routines and
handling that exists currently in LTOCodeGenerator, which we could
refactor out of both. I'm not sure where the best place to put the
refactored code is, maybe lib/CodeGen (which is where splitCodeGen lives)?

Biggest change in this update is due to gold not being thread-safe
by default, requiring lots of refactoring to make gold callbacks in
single-threaded mode.

Also changed from std::thread to the thread wrapper in LLVM, which
handles !LLVM_ENABLE_THREADS.

Some test updates to test both single and multi-threaded handling.

I have not refactored any of the code out of gold yet. Doing so will
require refactoring out other routines, such as codegenImpl and
its callees such as runLTOPasses, or invoking via a callback.
There is some overlap between the handling in these routines and
handling that exists currently in LTOCodeGenerator, which we could
refactor out of both. I'm not sure where the best place to put the
refactored code is, maybe lib/CodeGen (which is where splitCodeGen lives)?

It is more than just codegenImpl, but also getModuleForFile that would have to be refactored out of gold. Right now looking at the ThinLTO thread handling in gold as well as the LTO pass and codegen invocations in libLTO (LTOCodeGenerator), I don't think we gain much by refactoring this out. The actual thread handling in gold is pretty minimal, and very specific to gold. So I'd prefer to keep this in the gold plugin at least for now.

No longer WIP. Tested with both regression tests and several SPEC cpu2006 benchmarks. PTAL.

tejohnson updated this revision to Diff 42860.Dec 15 2015, 8:00 AM
  • Rebase and change to use new ThreadPool support.
rafael added inline comments.Dec 15 2015, 8:12 AM
test/tools/gold/X86/thinlto.ll
12

This gold invocation is not being tested.

32

These two only check the t4.thinlto.bc. Don't you want to, for example, run llvm-nm on t4?

tools/gold/gold-plugin.cpp
33

Why do you need Linker.h?

tejohnson added inline comments.Dec 15 2015, 8:43 AM
test/tools/gold/X86/thinlto.ll
12

Right, it was just checking to ensure that it succeeded without an error.

32

Same as above, it was just checking for the invocation succeeding without an error. I could run llvm-nm on the output file and check for "T f", just to make sure it is there and not ill-formed. Is that what you had in mind?

tools/gold/gold-plugin.cpp
33

renameModuleForThinLTO

tejohnson updated this revision to Diff 42870.Dec 15 2015, 9:29 AM
  • Add check for expected gold output file.
rafael added inline comments.Dec 15 2015, 12:55 PM
tools/gold/gold-plugin.cpp
67

thread or task now?

985

Start the function name with a lower case.

This is not a thread anymore. Task maybe?

990

It is thread safe since you create a new one for each thread, no?

1005

Don't we want something that uses gold's symbol resolution?

renameModuleForThinLTO will copy even stuff that gold has marked as preempted, no?

1036

This seems incompatible with threading or even multiple outputs, no? It looks like every thread will try to use the same output file name.

I would suggest producing an error for now.

Thanks for the comments. Responses and a couple questions below.

tools/gold/gold-plugin.cpp
67

Yeah, will change to TaskInfo and update the comments accordingly.

985

Will fix both issues

990

The comment was not entirely clear - what I meant was that I am creating a new one for each thread/task because of the fact that it is not thread-safe (i.e. they can't all share the same context). Will clarify.

1005

Unfortunately all the ThinLTO promotion logic and renaming support is in the ModuleLinker, so I couldn't just use IRMover::move with the Keep list.

Perhaps the ModuleLinker should have a mode where all it does is the promotion handling for exporting modules before calling IRMover::move. I.e. I think this would just need to do a version of processGlobalsForThinLTO where locals in a supplied ValuesToLink list (initiallized from the Keep list in gold) would be promoted if necessary. Similar to the existing processGlobalsForThinLTO but only for things already in the supplied ValuesToLink.

Does that sound right?

1036

The openOutputFile helper below will append a unique thread ID (should probably change this to TaskID...). Will add comment to that effect. Or do you still think it is better to error?

rafael edited edge metadata.Dec 15 2015, 1:40 PM
rafael added a subscriber: rafael.

Comment at: tools/gold/gold-plugin.cpp:940
@@ +939,3 @@
+ std::unique_ptr<llvm::Module> RenamedModule =
+ renameModuleForThinLTO(M, &CombinedIndex);

+ if (!RenamedModule)

rafael wrote:

Don't we want something that uses gold's symbol resolution?

renameModuleForThinLTO will copy even stuff that gold has marked as preempted, no?

Unfortunately all the ThinLTO promotion logic and renaming support is in the ModuleLinker, so I couldn't just use IRMover::move with the Keep list.

Perhaps the ModuleLinker should have a mode where all it does is the promotion handling for exporting modules before calling IRMover::move. I.e. I think this would just need to do a version of processGlobalsForThinLTO where locals in a supplied ValuesToLink list (initiallized from the Keep list in gold) would be promoted if necessary. Similar to the existing processGlobalsForThinLTO but only for things already in the supplied ValuesToLink.

Does that sound right?

Not sure. Thinking a bit more about it I think I am missing the big picture.

I was at least under the impression that we could:

  • Run the IRMover more or less like we do in normal LTO, but instead

of moving to a merged module, each task gets one file and moves it
into an empty module.

  • Run a transformation that updates name and visibility in place.
  • For each module we want to cherry pick something, FunctionImport brings it in.

Comment at: tools/gold/gold-plugin.cpp:971
@@ +970,3 @@
+ if (!options::obj_path.empty())
+ Filename = options::obj_path;

+ else if (options::TheOutputType == options::OT_SAVE_TEMPS)

rafael wrote:

This seems incompatible with threading or even multiple outputs, no? It looks like every thread will try to use the same output file name.

I would suggest producing an error for now.

The openOutputFile helper below will append a unique thread ID (should probably change this to TaskID...). Will add comment to that effect. Or do you still think it is better to error?

No, but please add a test :-)

Cheers,
Rafael

tejohnson updated this revision to Diff 43060.Dec 16 2015, 1:51 PM
tejohnson edited edge metadata.
  • Rebase and address Rafael's review comments.

    I think I have addressed all your comments. Notable changes from prior version:
  • Use gold's symbol resolution via IRMover, and invoke renameModuleForThinLTO afterwards to do renaming (with TODO noting that this is temporary until we can do this in place)
  • Rebase to use new RAII wrapper for plugin file handling. Add move assignment/copy constructor to enable moving ownership into TaskInfo object. Change RAII PluginInputFile wrapper to use a unique_ptr for the ld_plugin_input_file object so that it can be moved, and add a flag to prevent double-release on a moved object.
  • s/Thread/Task/
  • Add test to ensure gold's symbol resolution not overridden.
  • Add test for plugin option obj-path handling with ThinLTO threads
rafael added inline comments.Dec 17 2015, 9:07 AM
tools/gold/gold-plugin.cpp
101–102

Please rebase the patch.

102

You can just memcpy the ld_plugin_input_file, no ?

tejohnson added inline comments.Dec 17 2015, 11:16 AM
tools/gold/gold-plugin.cpp
101–102

Will do and upload the new one shortly.

102

I could and I considered that. As currently defined by gold it would save to memcpy. However, I thought it would be better to use a unique_ptr since it doesn't assume anything about the structure which isn't defined here, and it seemed clearer and cleaner to avoid copying. Note we pass a reference to this member to the ThreadPool::async to be used by the thread, that would have to be changed to a memcpy as well.

rafael added inline comments.Dec 17 2015, 1:35 PM
tools/gold/gold-plugin.cpp
102

OK. If we have a std::unique_ptr, we can use it instead of the Valid field, no? Valid is false iff File is null.

I got this warning:

/home/espindola/llvm/llvm/tools/gold/gold-plugin.cpp:133:17: warning: private field 'F' is not used [-Wunused-private-field]

claimed_file *F
tools/gold/gold-plugin.cpp
1003

If you change codegenImpl to take an ArrayRef you don't have to do this.

I got this warning:

/home/espindola/llvm/llvm/tools/gold/gold-plugin.cpp:133:17: warning: private field 'F' is not used [-Wunused-private-field]

claimed_file *F

Will fix. Looks like there are some stale comments about using this class for the join, which isn't necessary after switching to the ThreadPool. Will clean that up.

tools/gold/gold-plugin.cpp
102

Good point, will fix this.

1003

Ok, will change.

We should probably refactor splitCodeGen. It is odd that now we have
two parallel codegen paths. With thinLTO we already multiple BC files,
so it should probably look something like

if (SplitForParallelCodeGen)

ProduceMultipleModules();

Create the tasks.

Each task handles one bc file, which may be one of the original ones
if using thinLto or one of the split ones.

tejohnson updated this revision to Diff 43184.Dec 17 2015, 2:19 PM
  • Address review comments/suggestions

We should probably refactor splitCodeGen. It is odd that now we have
two parallel codegen paths. With thinLTO we already multiple BC files,
so it should probably look something like

if (SplitForParallelCodeGen)

ProduceMultipleModules();

Create the tasks.

Each task handles one bc file, which may be one of the original ones
if using thinLto or one of the split ones.

This will require some refactoring of SplitModule() as well, which currently takes a callback (that actually creates each thread) and does the module splitting. For the case where we don't want multiple split modules, like in ThinLTO, we simply pass a single output stream. Note that in both the split and non-split case the same codegen() routine is called to do the actual codegen part.

I think I've addressed all of your other comments. PTAL. Thanks!

rafael added inline comments.Dec 22 2015, 12:21 PM
tools/gold/gold-plugin.cpp
921

splitCodeGen can take a ArrayRef too. Why do you need the vec?

1006

It seems odd how much work the destructor of TaskInfo is doing.

Most of the work is here because gold is not thread safe, correct? It so, it seems better to write this code explicitly after

ThinLTOThreadPool.wait();
1029

Why do you need a worklist?

Can't you just just a simple loop over Modules?

1064

This can be just

Tasks.emplace_back(new TaskInfo(std::move(InputFile), std::move(OS),
                                NewFilename.c_str(), TempOutFile));

Per IRC discussion, will do some refactoring of splitCodeGen next, then subsequently rebase this patch on top of that. But I wanted to reply to the latest comments here and upload a new patch that addresses them first.

tools/gold/gold-plugin.cpp
921

Ah ok, fixed.

1006

Changed TaskInfo::~TaskInfo into TaskInfo::cleanup and invoked explicitly on each task after the wait().

1029

I think this was leftover from my original pre-ThreadPool implementation. Good point that it isn't needed. Updated to iterate over Modules as suggested.

1064

Fixed

tejohnson updated this revision to Diff 43489.Dec 22 2015, 3:34 PM
  • Address latest feedback.
tejohnson updated this revision to Diff 43745.Dec 29 2015, 10:17 AM

As discussed on IRC with Rafael, pull split codegen into gold-plugin and
use the ThreadPool support. Refactor both the split codegen and ThinLTO
handling to utilize a new CodeGen class that encapsulates the
optimization and code generation handling for each module (split or
not). This allows better code reuse between the ThinLTO and split
codegen cases.

For now I have included this along with the ThinLTO thread patch, to
show how it all fits together. I can commit the split code gen changes
first though, followed by the ThinLTO backend support. Let me know if
you would like to review these separately.

mehdi_amini added inline comments.Jan 2 2016, 7:28 PM
include/llvm/Support/thread.h
58

Can be committed separately I think.

Ping.

include/llvm/Support/thread.h
58

Will do.

tejohnson updated this revision to Diff 44824.Jan 13 2016, 6:19 PM
  • Rebase and improve -save-temps behavior with ThinLTO
tejohnson updated this object.Jan 13 2016, 8:15 PM

Ping.

Using this support extensively in my own ThinLTO spec testing. Would be great to get this reviewed and in tree. =)

Note it involves some refactoring of the split codegen path as suggested by Rafael on IRC (see the comment history for details, specifically Dec 29 update).

mehdi_amini edited edge metadata.Jan 28 2016, 11:14 AM

I'm not familiar with Gold, but here are a few minor comments

tools/gold/gold-plugin.cpp
101–102

(same =default here)

113

Any difference with PluginInputFile(PluginInputFile &&RHS) = default; ?

861

Note: you could reuse the TargetMachine for the next module processed by this thread.

1021

There is a bunch of duplicated code above (used in regular LTO as well I think)

In D15390#338559, @joker.eph wrote:

I'm not familiar with Gold, but here are a few minor comments

Thanks for the comments!

tools/gold/gold-plugin.cpp
101–102

Ditto.

113

Good point, will change to default

861

That's an interesting idea. But I don't think I have any ability to control this once I send tasks to the thread pool. Is there a good way to share things across tasks assigned to the same thread by the pool?

1021

True, the LTO handling in allSymbolsReadHook does some of the same things. But the LLVMContext and IRMover constructors are outside the loop over the modules since they can be shared in that case. And the invocation of getModuleForFile is a bit different. I could probably create a helper that does the getModuleForFile, setting of the target triple, and invoke IRMover::move though. I'm not sure if that ends up being clearer, but let me see what I could do here.

tejohnson updated this revision to Diff 46408.Jan 29 2016, 12:34 PM
tejohnson updated this object.
tejohnson edited edge metadata.

Address review comments. Use default move constructors, and refactor
common code into a helper.

Some more comment.

tools/gold/gold-plugin.cpp
846

It took me some time to understand what was going on, this "recursive" use of the CodeGen class can be confusing. As long as it is limited to this file I won't object.

861

Yeah is it annoying, in my local implementation I store a "per thread context" in a global map protect retrieving the Context with a mutex.
(I'm not asking you to do the same here and now)

969

Could this be done in the TaskInfo dtor?

999

Usually I prefer RAII (i.e. using a new scope).

1077

This could be

std::vector<ThinLTOTaskInfo> Tasks;
Tasks.reserve(Modules.size());

(same above for std::vector<std::unique_ptr<TaskInfo>> Tasks; around line 1023)

tejohnson added inline comments.Feb 1 2016, 4:17 PM
tools/gold/gold-plugin.cpp
846

Yeah, it was unfortunatly hard to get the refactoring and code sharing between the different modes without doing this. So I tried to document it as well as I could.

969

I previously had it there, but Rafael thought the dtor was too heavy-weight and wanted it more explicit. =)

999

Oh I see, the wait() is unnecessary if I provoke the ThreadPool destructor via RAII. Will do that here and for the ThinLTO thread pool as well.

1077

Ok

tejohnson updated this revision to Diff 46593.Feb 1 2016, 4:36 PM

Address more review comments: Use RAII on ThreadPool instead of expicit
wait(), and reserve TaskInfo vectors rather than emplacing unique_ptrs.

rafael added inline comments.Feb 9 2016, 4:54 PM
tools/gold/gold-plugin.cpp
938

s/thread/task/

985

It is nice that now we always use a ThreadPool.

It would be awesome if this could be refactored so that there was just one ThreadPool for thinlto and conventional parallel codegen.

tejohnson added inline comments.Feb 10 2016, 8:10 AM
tools/gold/gold-plugin.cpp
938

Fixed here and a couple other places.

985

The task type is different, and the iteration/handling to add tasks is different - I'm not sure how much code reuse we would get by sharing the thread pool creation and management code. The code to create the thread pool and insert into it is pretty minimal by itself. Also note that you never are using both thread pools in a single compilation.

tejohnson updated this revision to Diff 47458.Feb 10 2016, 8:12 AM
  • s/thread/task/ in a couple places
pcc added inline comments.Feb 12 2016, 12:24 PM
tools/gold/gold-plugin.cpp
975

I don't think thread pools are necessary for split code gen, as we can already perfectly assign the right amount of work to individual threads. Also, this implementation loses the pipelining feature from the original code (i.e. worker threads can work on codegen'ing while the main thread is still splitting). I would prefer you to use the existing implementation in llvm/CodeGen/ParallelCG.h.

mehdi_amini added inline comments.Feb 12 2016, 2:07 PM
tools/gold/gold-plugin.cpp
975

The thread that is doing the splitting can issue other jobs to the thread pool, providing the desired pipeline.

Just fuse the loop body below within the lambda...

mehdi_amini added inline comments.Feb 12 2016, 2:10 PM
tools/gold/gold-plugin.cpp
975

I'll add that while the pooling is not necessary if you only queue as many jobs as you have threads, it is not a reason by itself not to use it: the paradigm is fairly clear, and it decouples the actual splitting granularity from the number of actual worker threads, allowing to experiment with different numbers for each (providing better pipelining for instance).

pcc added inline comments.Feb 12 2016, 2:22 PM
tools/gold/gold-plugin.cpp
975

Yes, but the current implementation doesn't need any of that. If we experimentally find that decoupling would provide some benefit, then by all means we can start using thread pools here.

In any case, if there is a compelling reason to use thread pools, the right place to make the change is in lib/CodeGen/ParallelCG.cpp rather than in a duplicate implementation here. We can defer what the design for that should look like simply by not using thread pools yet.

tejohnson updated this revision to Diff 49061.Feb 25 2016, 7:41 AM

Modify the patch to implement what is hopefully a compromise solution on
split code gen. I modified lib/CodeGen/ParallelCG.cpp to use a
ThreadPool, and go back to invoking it from the gold plugin.

This has a few nice effects:

  • ThreadPool used by all ParallelCG consumers.
  • Restores the pipelining of splitting and codegen (although note that with a tweak to the old version of this thread that this could be attained in the gold-plugin implementation as well).
  • Avoids the recursive construction of the CodeGen object on the split code gen path.

Can one of you take a look and see if this is acceptable, and if so and
there are not other comments, mark it accepted?

tejohnson updated this revision to Diff 49078.Feb 25 2016, 9:05 AM

Update a comment to match new version. Also, rename the CodeGen Filename
member to SaveTempsFilename to make it clearer and disambiguate from
places that use Filename as a local var, and initialize it as expected
for ThinLTO. Found this issue while testing changes to dependent patch
D16173.

pcc edited edge metadata.Feb 26 2016, 11:46 AM

Seems reasonable to me. Mehdi?

Great, thanks. Do either of you have any other comments or if not can one of you mark this accepted?

pcc accepted this revision.Mar 3 2016, 10:57 AM
pcc edited edge metadata.

LGTM

tools/gold/gold-plugin.cpp
935–941

I don't think this should be dependent on a property of the host machine, as there are behavioral differences between parallelism levels (e.g. symbol ordering will be different, and some uses of inline asm won't work with parallelism >1, although some of that is arguably a bug). Can you please update the comment to reflect that?

This revision is now accepted and ready to land.Mar 3 2016, 10:57 AM
In D15390#367460, @pcc wrote:

LGTM

Thanks!

tools/gold/gold-plugin.cpp
935–941

Ok, will do.