Page MenuHomePhabricator

[ELF/LTO] Parallel Codegen for LLD
ClosedPublic

Authored by davide on Apr 11 2016, 6:17 PM.

Details

Summary

This should be ready'ish for review now.

Diff Detail

Event Timeline

davide updated this revision to Diff 53344.Apr 11 2016, 6:17 PM
davide retitled this revision from to parallel LTO wip.
davide updated this object.
davide added reviewers: davide, pcc, rafael.
davide edited edge metadata.EditedApr 11 2016, 6:51 PM

With/without patch (4 threads with the patch): Release + DebugInfo (without assert) linking clang.

real    22m27.724s
user    34m21.325s
sys     0m28.535s
real    28m10.310s
user    27m59.016s
sys     0m10.189s
pcc edited edge metadata.Apr 11 2016, 6:52 PM

What program is that?

In D18999#397971, @pcc wrote:

What program is that?

Updated comment, sorry hit 'submit' too quickly.

Release without debug info (clang):

real    14m56.917s
user    23m47.331s
sys     0m8.045s
real    23m17.522s
user    23m13.189s
sys     0m3.422s
davide updated this revision to Diff 53504.Apr 12 2016, 5:58 PM
davide retitled this revision from parallel LTO wip to [ELF/LTO] Parallel Codegen for LLD.
davide updated this object.
davide edited edge metadata.
davide added a subscriber: llvm-commits.
ruiu added a subscriber: ruiu.Apr 12 2016, 6:09 PM
ruiu added inline comments.
ELF/LTO.cpp
165

Because NumThreads is 1, nothing would run in parallel?

davide added inline comments.Apr 12 2016, 6:24 PM
ELF/LTO.cpp
165

Yes, I have another patch to add an option to specify the number of threads. I put this up to review to understand if the logic is sound.
I'm collecting numbers in multiple configurations (1,2,3,4 threads).
The previous comments in the review have some very early numbers using 4 threads.

Cheers,

Davide

davide updated this revision to Diff 53507.Apr 12 2016, 6:36 PM

On a second thought, added the option (in case also somebody wants to play with it).

rafael edited edge metadata.Apr 12 2016, 7:30 PM

From 28m10.310s to 22m27.724s when using 4 threads is a pretty small speedup.

Any idea what is slow? Is it just that each of the 1/4 sized chunks is still slow? Is the split not even in compile time?

What is the impact in output size. Even better if you have performance numbers of the linked program.

ELF/LTO.cpp
40

Drop the () around many.

142

Is this related to parallel LTO?

I agree I expected a bigger speedup. I'll investigate further (and post more precise numbers) ASAP.

ELF/LTO.cpp
142

you use the returned type as argument to codeine.

davide added inline comments.Apr 12 2016, 8:02 PM
ELF/LTO.cpp
142

*codegen.

pcc added a comment.Apr 12 2016, 8:13 PM

Any idea what is slow? Is it just that each of the 1/4 sized chunks is still slow? Is the split not even in compile time?

Parallel LTO codegen is highly beholden to Amdahl's law. With parallel LTO codegen there is by default a large serial optimization phase at the start, which is basically the regular LTO optimization pipeline. The only part we parallelize is the backend. Davide's numbers seem about right if he was not using an --lto-O flag to reduce the opt level (which basically turns off most of the LTO-stage optimizers, so it's most useful if you're using something like -fsanitize=cfi).

Even at lower LTO opt levels, there's a significant amount of serial time spent splitting, serializing and deserializing the partitions. With debug info enabled, the debug info needs to be duplicated between the partitions as well. The best numbers we saw were a roughly 2x speedup with >4 threads, without debug info and at opt level 1.

What we found was that parallel LTO codegen is most useful when you don't mind throwing CPU and RAM at the problem of saving a moderately sized amount of time linking your program. It doesn't completely solve the scalability problem, which is why I've been working towards getting ThinLTO supported in LLD, as the scalability story is much better there.

ruiu added inline comments.Apr 13 2016, 9:08 AM
ELF/Driver.cpp
311 ↗(On Diff #53507)

Move this before Config->Optimize to sort these lines.

ELF/LTO.cpp
38–39

Move this line just before raw_fd_ostream OS... so that the definition gets closer to use.

39–42

I'd spell this variable Filename as one word.

40

I found that appending an empty string is a bit confusing here. I'd probably do

if (Many)
  Filename += utostr(I);
142

This function needs a comment.

145

Where does this MAttrs come from?

165

Why don't you just use Config->CGThreads as the number of threads?

184

Remove {}.

ELF/Options.td
237–238 ↗(On Diff #53507)

I prefer --thread-count because gold has that option.

ELF/SymbolTable.cpp
117

Move this code before the enclosing for-loop. (Because the entire for-loop is to replace bitcode symbols.)

pcc added inline comments.Apr 13 2016, 10:14 AM
ELF/Options.td
237–238 ↗(On Diff #53507)

Gold's --thread-count does not affect LTO. Furthermore, parallel LTO codegen can result in the linker producing different binaries based on the number of threads, and users of --thread-count probably wouldn't expect that option to change the binary. I think this should probably be something prefixed with --lto-.

davide added inline comments.Apr 13 2016, 10:16 AM
ELF/Options.td
237–238 ↗(On Diff #53507)

That option has a different semantic. -jobs is what the plugin uses.
If you prefer another name, I'm all for it, but I wouldn't clash.

ELF/SymbolTable.cpp
117

Which code in particular?

ruiu added inline comments.Apr 13 2016, 1:42 PM
ELF/Options.td
237–238 ↗(On Diff #53507)

Then maybe --lto-thread-count or --lto-jobs?

ELF/SymbolTable.cpp
117

Sorry, I meant this comment.

rafael edited edge metadata.Apr 13 2016, 6:46 PM
rafael added a subscriber: echristo.

I cced Eric who seems to think that it should be possible to avoid having multiple TargetMachines, which would be the perfect solution to this problem.

You should be able to avoid it, you just need to handle resetTargetOptions first (at least). You'll also need an optional lock on getSubtarget to deal with that. Those are the only problems I can think of off the top of my head.

-eric

What about the subtargets themselves?

Probably should check, but I don't -think- there's anything mutable there. If there is it should be fixed.

Apparently I and pcc replied at the same time.

I cced Eric who seems to think that it should be possible to avoid having multiple TargetMachines, which would be the perfect solution to this problem.

I cced Eric who seems to think that it should be possible to avoid having multiple TargetMachines, which would be the perfect solution to this problem.

So, we currently work around the problem with a Factory.
I opened https://llvm.org/bugs/show_bug.cgi?id=27361 so we don't lose track of this possible work (and I hope I or somebody else can get to it at some point)

So, we currently work around the problem with a Factory.

You mean we *can* work around it, right? In this change we still have
two code paths that create a TargetMachine.

What we can do to work around the problem if fixing pr27361 is too
much work for now is passing a std::function() with no arguments that
returns a std::unique_ptr<TargetMachine> to splitCodeGen.

That way we can use the same code to create a TargetMachine is both places.

Cheers,
Rafael

davide updated this revision to Diff 53922.Apr 15 2016, 11:17 AM

Addressed everybody's comments hopefully. The patch is smaller now.

ruiu added inline comments.Apr 15 2016, 12:40 PM
ELF/Config.h
97 ↗(On Diff #53922)

Since this is for --lto-jobs, I'd name LtoJobs.

ELF/LTO.cpp
143

Instead of defining NumThreads, use Config->LtoJobs directly.

154

Does

[&]() { return getTargetMachine; }

work?

mehdi_amini added inline comments.Apr 15 2016, 1:02 PM
ELF/LTO.cpp
154

Rather: [this] () { return getTargetMachine(); }

davide updated this revision to Diff 53942.Apr 15 2016, 1:37 PM
davide marked 3 inline comments as done.
davide added inline comments.
ELF/LTO.cpp
143

No, NumThreads is also used later in the function.

ruiu added a comment.Apr 15 2016, 3:07 PM

This is looking fine, but I'll let someone who's working on LTO to sign off.

pcc added a comment.Apr 15 2016, 3:15 PM

Please add a test showing that we can link a program with 2 or more partitions.

davide updated this revision to Diff 53960.Apr 15 2016, 3:30 PM

Add a test as requested by Peter.

pcc accepted this revision.Apr 15 2016, 3:37 PM
pcc edited edge metadata.

LGTM

This revision is now accepted and ready to land.Apr 15 2016, 3:37 PM
davide closed this revision.Apr 15 2016, 3:44 PM

Committed thusly (r266484)