This should be ready'ish for review now.
Diff Detail
Event Timeline
With/without patch (4 threads with the patch): Release + DebugInfo (without assert) linking clang.
real 22m27.724s user 34m21.325s sys 0m28.535s
real 28m10.310s user 27m59.016s sys 0m10.189s
Release without debug info (clang):
real 14m56.917s user 23m47.331s sys 0m8.045s
real 23m17.522s user 23m13.189s sys 0m3.422s
ELF/LTO.cpp | ||
---|---|---|
166 | Because NumThreads is 1, nothing would run in parallel? |
ELF/LTO.cpp | ||
---|---|---|
166 | Yes, I have another patch to add an option to specify the number of threads. I put this up to review to understand if the logic is sound. Cheers, Davide |
From 28m10.310s to 22m27.724s when using 4 threads is a pretty small speedup.
Any idea what is slow? Is it just that each of the 1/4 sized chunks is still slow? Is the split not even in compile time?
What is the impact in output size. Even better if you have performance numbers of the linked program.
ELF/LTO.cpp | ||
---|---|---|
44 | Drop the () around many. | |
143 | Is this related to parallel LTO? |
I agree I expected a bigger speedup. I'll investigate further (and post more precise numbers) ASAP.
ELF/LTO.cpp | ||
---|---|---|
143 | you use the returned type as argument to codeine. |
ELF/LTO.cpp | ||
---|---|---|
143 | *codegen. |
Any idea what is slow? Is it just that each of the 1/4 sized chunks is still slow? Is the split not even in compile time?
Parallel LTO codegen is highly beholden to Amdahl's law. With parallel LTO codegen there is by default a large serial optimization phase at the start, which is basically the regular LTO optimization pipeline. The only part we parallelize is the backend. Davide's numbers seem about right if he was not using an --lto-O flag to reduce the opt level (which basically turns off most of the LTO-stage optimizers, so it's most useful if you're using something like -fsanitize=cfi).
Even at lower LTO opt levels, there's a significant amount of serial time spent splitting, serializing and deserializing the partitions. With debug info enabled, the debug info needs to be duplicated between the partitions as well. The best numbers we saw were a roughly 2x speedup with >4 threads, without debug info and at opt level 1.
What we found was that parallel LTO codegen is most useful when you don't mind throwing CPU and RAM at the problem of saving a moderately sized amount of time linking your program. It doesn't completely solve the scalability problem, which is why I've been working towards getting ThinLTO supported in LLD, as the scalability story is much better there.
ELF/Driver.cpp | ||
---|---|---|
324 | Move this before Config->Optimize to sort these lines. | |
ELF/LTO.cpp | ||
42–43 | Move this line just before raw_fd_ostream OS... so that the definition gets closer to use. | |
43 | I'd spell this variable Filename as one word. | |
44 | I found that appending an empty string is a bit confusing here. I'd probably do if (Many) Filename += utostr(I); | |
143 | This function needs a comment. | |
146 | Where does this MAttrs come from? | |
166 | Why don't you just use Config->CGThreads as the number of threads? | |
185 | Remove {}. | |
ELF/Options.td | ||
241–242 | I prefer --thread-count because gold has that option. | |
ELF/SymbolTable.cpp | ||
132 | Move this code before the enclosing for-loop. (Because the entire for-loop is to replace bitcode symbols.) |
ELF/Options.td | ||
---|---|---|
241–242 | Gold's --thread-count does not affect LTO. Furthermore, parallel LTO codegen can result in the linker producing different binaries based on the number of threads, and users of --thread-count probably wouldn't expect that option to change the binary. I think this should probably be something prefixed with --lto-. |
I cced Eric who seems to think that it should be possible to avoid having multiple TargetMachines, which would be the perfect solution to this problem.
You should be able to avoid it, you just need to handle resetTargetOptions first (at least). You'll also need an optional lock on getSubtarget to deal with that. Those are the only problems I can think of off the top of my head.
-eric
Probably should check, but I don't -think- there's anything mutable there. If there is it should be fixed.
Apparently I and pcc replied at the same time.
So, we currently work around the problem with a Factory.
I opened https://llvm.org/bugs/show_bug.cgi?id=27361 so we don't lose track of this possible work (and I hope I or somebody else can get to it at some point)
So, we currently work around the problem with a Factory.
You mean we *can* work around it, right? In this change we still have
two code paths that create a TargetMachine.
What we can do to work around the problem if fixing pr27361 is too
much work for now is passing a std::function() with no arguments that
returns a std::unique_ptr<TargetMachine> to splitCodeGen.
That way we can use the same code to create a TargetMachine is both places.
Cheers,
Rafael
ELF/LTO.cpp | ||
---|---|---|
155 | Rather: [this] () { return getTargetMachine(); } |
ELF/LTO.cpp | ||
---|---|---|
144 | No, NumThreads is also used later in the function. |
Since this is for --lto-jobs, I'd name LtoJobs.