This is an archive of the discontinued LLVM Phabricator instance.

RFC: [clang] Multithreaded compilation support -- NOT FOR SUBMIT
Changes PlannedPublic

Authored by aganea on Sep 17 2018, 2:35 PM.

Details

Summary

This is very preliminary change which adds support for (clang-cl) /MP (Build with multiple processes). Doc for /MP is here.

Support for -j to the clang driver could also be added along the way.

I would simply like a general advice on this change. I don't plan to commit this as it is. If the general idea seems good, I'll cut it down in several smaller pieces.

The change about having explicit return codes (Program.h / enum ProgramReturnCode) should probably be discussed separately, although if you have an opinion about that, please do.

Some timings

Full rebuild of LLVM + Clang + LLD (at r341847), Release, optimized tablegen. VS2017 15.8.3, Ninja 1.8.2, CMake 3.12.2

Config 1 : Intel Xeon Haswell 6 cores / 12 HW threads, 3.5 GHz, 15M cache, 128 GB RAM, SSD 550 MB/s

MSBuild, MSVC /MP(56min 43 sec)2 parallel msbuild
MSBuild, Clang + LLD(2hour 40min)2 parallel msbuild
MSBuild, Clang /MP + LLD(51min 8sec)2 parallel msbuild
Ninja, MSVC(37min)
Ninja, Clang [1](31min 52sec)

Config 2 : Intel Xeon Skylake 18 cores / 36 HW threads, x2 (Dual CPU), 72 HW threads total, 2.3 GHz, 24.75M cache, 128 GB RAM, NVMe 4.6 GB/s

MSBuild, MSVC /MP(12min 8sec)32 parallel msbuild
MSBuild, Clang + LLD(29min 12sec)32 parallel msbuild
MSBuild, Clang /MP + LLD(9min 22sec)32 parallel msbuild
Ninja, MSVC(7min 35sec)
Ninja, Clang [1](11min)

[1] Clang compiled with Clang.

Some remarks:

  • Ninja is better in regards to discovering cmake "features" (Looking for ... - found); whereas the Visual Studio generator goes though a MSBuild instance for each "feature" which makes it very slow.
  • Ninja seems to have a better job allocation that MSBuild, which compiles monolithically each vcproj one by one.
  • Ninja + Clang on the Skylake is significantly worse than Ninja + MSVC, probably because of the poor Windows 10 thread scheduler which is known to under-perform on high cores count, see here. The reason is that for each file compiled, clang-cl.exe invokes an additional -cc1 child process.
  • Each new clang-cl.exe child process creation costs between 60-100ms (that is the time spent between the parent's CreateProcess() and the first file accessed by the child .exe). In contrast MSBuild + Clang /MP uses only one parent .exe per vcproject.

Diff Detail

Event Timeline

aganea created this revision.Sep 17 2018, 2:35 PM

What about the timings of clang-cl without /MP?

hans added a comment.Sep 18 2018, 12:45 AM

What about the timings of clang-cl without /MP?

And one using Ninja rather than msbuild.

I think the real question is whether we want clang and clang-cl to do this. I'm not sure we do as it adds complexity for solving a problem that's better solved at the build system level.

zturner added a subscriber: aganea.Sep 18 2018, 6:35 AM

In an ideal world yes, but the reality is that many people still use
MSBuild, and in that world /MP presumably helps quite a bit. And given that
many people already depend on this functionality of cl, it’s a potential
showstopper for migrating if we don’t support it. That said, if the benefit
isn’t significant that’s a stronger argument for not supporting it imo

aganea edited the summary of this revision. (Show Details)Sep 18 2018, 3:30 PM

The goal of this change is frictionless compilation into VS2017 when using clang-cl as a compiler. We've realized that compiling Clang+LLD with Clang generates a faster executable that with MSVC (even latest one).
I currently can't see a good way of generating the Visual Studio solution with CMake, while using Ninja+clang-cl for compilation. They are two orthogonal configurations. Any suggestions?

aganea edited the summary of this revision. (Show Details)Sep 18 2018, 3:36 PM

The goal of this change is frictionless compilation into VS2017 when using clang-cl as a compiler. We've realized that compiling Clang+LLD with Clang generates a faster executable that with MSVC (even latest one).
I currently can't see a good way of generating the Visual Studio solution with CMake, while using Ninja+clang-cl for compilation. They are two orthogonal configurations. Any suggestions?

I don't think this is necessary. I think your updated Matrix is pretty good.

I'm surprised to see that Ninja + Clang is slower than Ninja + MSVC. Are you using lld here?

That said, the numbers are pretty convincing

These two rows alone:
MSBuild, Clang + LLD (29min 12sec) 32 parallel msbuild
MSBuild, Clang /MP + LLD (9min 22sec) 32 parallel msbuild

Are enough to make this patch worthy of serious consideration. (I haven't looked at the content / complexity yet, was waiting on the numbers first).

The goal of this change is frictionless compilation into VS2017 when using clang-cl as a compiler. We've realized that compiling Clang+LLD with Clang generates a faster executable that with MSVC (even latest one).
I currently can't see a good way of generating the Visual Studio solution with CMake, while using Ninja+clang-cl for compilation. They are two orthogonal configurations. Any suggestions?

I don't think this is necessary. I think your updated Matrix is pretty good.

I'm surprised to see that Ninja + Clang is slower than Ninja + MSVC. Are you using lld here?

Yes, all the ‘Clang’ rows use both clang-cl and lld-link.
Like stated above, I think this is caused by a lot more processes (clang-cl.exe) being invoked. In contrast, cl.exe does not invoke a child process. There are about 3200 files to compiles, which makes 6400 calls to clang-cl. At about ~60ms lead time per process, that adds up to an extra ~3min wall clock time. It is the simplest explanation I could find.

The goal of this change is frictionless compilation into VS2017 when using clang-cl as a compiler. We've realized that compiling Clang+LLD with Clang generates a faster executable that with MSVC (even latest one).
I currently can't see a good way of generating the Visual Studio solution with CMake, while using Ninja+clang-cl for compilation. They are two orthogonal configurations. Any suggestions?

I don't think this is necessary. I think your updated Matrix is pretty good.

I'm surprised to see that Ninja + Clang is slower than Ninja + MSVC. Are you using lld here?

Yes, all the ‘Clang’ rows use both clang-cl and lld-link.
Like stated above, I think this is caused by a lot more processes (clang-cl.exe) being invoked. In contrast, cl.exe does not invoke a child process. There are about 3200 files to compiles, which makes 6400 calls to clang-cl. At about ~60ms lead time per process, that adds up to an extra ~3min wall clock time. It is the simplest explanation I could find.

Would you mind syncing to r342002 and trying again? I don't doubt your results, but I'm interested to see how much of an improvement this patch provides.

commit 840717872a0e0f03b19040ef143bf45aa1a7f0a0
Author: Reid Kleckner <rnk@google.com>
Date:   Tue Sep 11 22:25:13 2018 +0000

    [cmake] Speed up check-llvm 5x by delay loading shell32 and ole32

    Summary:
    Previously, check-llvm on my Windows 10 workstation took about 300s to
    run, and it would lock up my mouse. Now, it takes just under 60s.
    Previously running the tests only used about 15% of the available CPU
    time, now it uses up to 60%.

    Shell32.dll and ole32.dll have direct dependencies on user32.dll and
    gdi32.dll. These DLLs are mostly used to for Windows GUI functionality,
    which most LLVM tools don't need. It seems that these DLLs acquire and
    release some system resources on startup and exit, and that requires
    acquiring the same highly contended lock discussed in this post:
    https://randomascii.wordpress.com/2017/07/09/24-core-cpu-and-i-cant-move-my-mouse/

    Most LLVM tools still have a transitive dependency on
    SHGetKnownFolderPathW, which is used to look up the user home directory
    and local AppData directory. We also use SHFileOperationW to recursively
    delete a directory, but only LLDB and clang-doc use this today. At some
    point, we might consider removing these last shell32.dll deps, but for
    now, I would like to take this free performance.

    Many thanks to Bruce Dawson for suggesting this fix. He looked at an ETW
    trace of an LLVM test run, noticed these DLLs, and suggested avoiding
    them.

    Reviewers: zturner, pcc, thakis

    Subscribers: mgorny, llvm-commits, hiraditya

    Differential Revision: https://reviews.llvm.org/D51952

Notes:
    git-svn-rev: 342002
rnk added a comment.Sep 18 2018, 5:15 PM

This is pretty cool. The process launching APIs in LLVM were pretty basic, left a lot to be desired, returned ints, etc etc. This addresses a lot of that.

clang/trunk/lib/Driver/Driver.cpp
3030

Seems nice to save a syscall and not ask how many cores we have if we were given an explicit value first.

llvm/trunk/lib/Support/Windows/Program.inc
424

I guess no Posix implementation? It's kind of hard to know if we made the right abstractions without doing it for Windows and *nix.

hans added a comment.Sep 19 2018, 12:42 AM

Thanks for adding the Ninja numbers. It confirms that Ninja is significantly faster than MSBuild + /MP.

Since that's the case, maybe we should be poking MS about making MSBuild faster instead of adding /MP support to Clang? Or making it easier to use Ninja in VS projects? Your patch says RFC after all :-)

Thanks for adding the Ninja numbers. It confirms that Ninja is significantly faster than MSBuild + /MP.

It is true for the 6-cores Xeon, but not for the dual-18-cores. I think there are two issues there, somehow unrelated to /MP: 1. Invoking the child clang-cl -cc1 for each cpp file makes things much slower. I’ve clocked the invocation at about 60-100ms (which is caused mainly by loading dlls & registery accesses). Most likely Reid’s change about delay loading dlls should improve that. 2. The other issue is, I think, the process context switching at the OS-level. This link: https://www.phoronix.com/scan.php?page=article&item=2990wx-linux-windows&num=4 - shows that multi-threading is significantly faster on a Linux machine, as far as high cores count goes (when compared with the same test ran on the same machine under Windows 10).

Since that's the case, maybe we should be poking MS about making MSBuild faster instead of adding /MP support to Clang? Or making it easier to use Ninja in VS projects? Your patch says RFC after all :-)

Sad for Microsoft, but at this point it is a _design_ change MSBuild needs. And as a PM I would rather invest in bindings to better build systems (Ninja, Fastbuild). However I expect there are still users of MSBuild out there, and without /MP this means essentially that migrating to clang-cl requires also changing their build system and their VS2017 integration with the said build system.

Since it helps existing msbuild configs, adding this seems like a good thing to me.

clang-cl isn't supposed to do (explicit) registry accesses when you hold it right (pass in -fms-compatibility-version etc). Have you seen registry access costs, or is that speculation?

Re fastbuild: We use ninja with https://chromium.googlesource.com/infra/goma/client/ as distributed build system, which gives use the same build system on all platforms, -j1000 builds, and we can build 40K translation units in a few minutes. We're pretty happy with our setup :-)

clang-cl isn't supposed to do (explicit) registry accesses when you hold it right (pass in -fms-compatibility-version etc). Have you seen registry access costs, or is that speculation?

No speculation, I will share the FileMon logs shortly. It is indirectly dependent DLLs that access the registry, not clang-cl itself.

aganea marked an inline comment as done.Sep 20 2018, 1:26 PM

It seems Reid's change has done good: cmake is not as significant as before in the build process. See below the improved timings:

The tests consist in a full cleanup (delete build folder), cmake to regenerate, then a full rebuild of LLVM + Clang + LLD (at r342552), Release target, optimized tablegen.
VS2017 15.8.3, Ninja 1.8.2, CMake 3.12.2
For the clang-cl tests, I'm not using any official LLVM release, only the binaries I built myself. lld-link is used in that case.

(built with MSVC) means the LLVM toolchain used to perfom this test was compiled in a previous run with MSVC cl 15.8.3
(built with Clang) means the LLVM toolchain used to perform this test was compiled in a previous run with Clang at r342552

I took the best figures from several runs (ran in a random order).


Config 1 : Intel Xeon Haswell 6 cores / 12 HW threads, 3.5 GHz, 15M cache, 128 GB RAM, SSD 550 MB/s

MSBuild :

MSVC cl /MP(50min 26sec)2 parallel msbuild
MSVC cl /MP(40min 23sec)8 parallel msbuild
MSVC cl /MP(40min 5sec)16 parallel msbuild
clang-cl (built with MSVC)(43min 36sec)16 parallel msbuild
clang-cl (built with Clang)(43min 42sec)16 parallel msbuild
clang-cl /MP (built with MSVC)not tested
clang-cl /MP (built with Clang)(36min 13sec)8 parallel msbuild
clang-cl /MP (built with Clang)(34min 57sec)16 parallel msbuild

Ninja:

MSVC cl(33min 29sec)
clang-cl (built with MSVC)(30min 2sec)
clang-cl (built with Clang)(28min 29sec)

Config 2 : Intel Xeon Skylake 18 cores / 36 HW threads, x2 (Dual CPU), 72 HW threads total, 2.3 GHz, 24.75M cache, 128 GB RAM, NVMe 4.6 GB/s

MSBuild :

MSVC cl /MP(10min 3sec)32 parallel msbuild
clang-cl (built with MSVC)(24min 15sec)32 parallel msbuild
clang-cl (built with Clang)(21min 21sec)32 parallel msbuild
clang-cl /MP (built with MSVC)(7min 52sec)32 parallel msbuild
clang-cl /MP (built with Clang)(7min 30sec)32 parallel msbuild

Ninja:

MSVC cl(7min 25sec)
clang-cl (built with MSVC)(8min 23sec)
clang-cl (built with Clang)(8min)
llvm/trunk/lib/Support/Windows/Program.inc
424

Yes, I currenly don't have an Unix box at hand. But I will implement it shortly as it needs to be atomically commited with the Windows part.

The process stuff looks cool and useful independently of /MP. Would it be possible to break that into a separate patch, and add a unit test for the behavior of the WaitAll flag?

aganea added a comment.EditedSep 20 2018, 2:07 PM

@thakis > clang-cl isn't supposed to do (explicit) registry accesses when you hold it right (pass in -fms-compatibility-version etc). Have you seen registry access costs, or is that speculation?

Please see this log:

- the child clang-cl -cc1 takes about ~117ms until it gets into the global initializers. This is on my Haswell PC. On the Skylake, this takes "only" ~60ms.
This probably explains why Ninja is slower on the Skylake when using clang-cl as a compiler. There should be a shorter codepath maybe when only a single .cpp is being compiled, and avoid running the child process.

The process stuff looks cool and useful independently of /MP. Would it be possible to break that into a separate patch, and add a unit test for the behavior of the WaitAll flag?

Of course, will do.

@thakis > clang-cl isn't supposed to do (explicit) registry accesses when you hold it right (pass in -fms-compatibility-version etc). Have you seen registry access costs, or is that speculation?

Please see this log:

- the child clang-cl -cc1 takes about ~117ms until it gets into the global initializers. This is on my Haswell PC. On the Skylake, this takes "only" ~60ms.
This probably explains why Ninja is slower on the Skylake when using clang-cl as a compiler. There should be a shorter codepath maybe when only a single .cpp is being compiled, and avoid running the child process.

Huh, interesting! I had a local hack years ago where I had measured how much not spawning a subprocess for cc1 saves (it looked like https://reviews.llvm.org/D52411) and over here it didn't do anything. Can you check if patching that in helps you a lot? If so, we should reconsider doing something like that.

...and to reword this a bit: Clang taking a long time to start up in some configurations is a bug we should profile and fix :-)

aganea added a comment.EditedOct 5 2018, 2:09 PM

...and to reword this a bit: Clang taking a long time to start up in some configurations is a bug we should profile and fix :-)

This is time spent in ntdll.dll loading various low-level libraries like kernel32.dll, KernelBase.dll, combase.dll and so on. So at least 20ms just getting to main(). Then, about ~10ms spent loading dbghelp.dll while installing the exception handler in llvm::RegisterHandler. So a lot of wasted time. The easiest would be to just not invoke a new child process!


New timings without the child clang-cl.exe being invoked (hacked from D52411). The improvement is significant. Tested at r343846.

Config 1: Intel Xeon Haswell 6 cores / 12 HW threads, 3.5 GHz, 15M cache, 128 GB RAM, SSD 550 MB/s

Ninja:

MSVC cl(37min 19sec)
clang-cl (built with Clang)(in progress-will update a bit later)
clang-cl no -cc1 process (built with Clang)(in progress-will update a bit later)

Config 2: Intel Xeon Skylake 18 cores / 36 HW threads, x2 (Dual CPU), 72 HW threads total, 2.3 GHz, 24.75M cache, 128 GB RAM, NVMe SSD 4.6 GB/s

Ninja:

MSVC cl(7min 33sec)
clang-cl (built with Clang)(9min 2sec)
clang-cl no -cc1 process (built with Clang)(7min 9sec)

This asks whether the improvement will be of the same order, if invoking just one clang-cl.exe for the whole compilation process. A sort of local compilation-as-a-service.
In that scenario, Ninja could invoke clang-cl.exe and pass it all the files to be compiled, and let clang iterate and multi-thread internally. That could be quite a significant change, however the improvements could be important.
However that would probably break the concept behind Goma I suppose. Unless you split the invocations to precisely one clang-cl per available (remote) agent.

aganea added a comment.EditedOct 10 2018, 11:32 AM

Updated tests again. This time I've closed all applications, all unused services, and tried to have a "clean" machine.

New timings without the child clang-cl.exe being invoked (hacked from D52411). The test consists in building Clang+LLVM+LLD at r343846 using the compiler specified on each line of the table.

Config 1: Intel Xeon Haswell 6 cores / 12 HW threads, 3.5 GHz, 15 MB cache, 128 GB RAM, SSD 550 MB/s (Windows 10 Fall Creators update 1709)

Ninja:

MSVC 15.8.6 (cl + link)29 min 4 sec
clang-cl + lld-link 7.0 official release25 min 45 sec
clang-cl + lld-link 8.0 r343846 built with clang-cl 7.0 official (+D52411, no -cc1 process)23 min 27 sec

Config 2: Intel Xeon Skylake 18 cores / 36 HW threads, x2 (Dual CPU), 72 HW threads total, 2.3 GHz, 24.75 MB cache each, 128 GB RAM, NVMe SSD 4.6 GB/s (Windows 10 Fall Creators update 1709)

Ninja:

MSVC 15.8.6 (cl + link)6 min 40 sec
clang-cl + lld-link 7.0 official release6 min 24 sec
clang-cl + lld-link 8.0 r343846 built with clang-cl 7.0 official (+D52411, no -cc1 process)5 min 10 sec

I'm wondering if the improvement is of the same order on Linux. I'd be interested to see your build times, out of curiosity? Can any of you try it with and without D52411?

I can try to get some timings from my machine. How do we handle crash
recovery in the case where we don't spawn a child process? I thought the
whole reason for spawning the cc1 driver as a separate process was so that
we could collect and report crash information in a nice way. Not having
that seems like a heavy price to pay.

I can try to get some timings from my machine. How do we handle crash
recovery in the case where we don't spawn a child process? I thought the
whole reason for spawning the cc1 driver as a separate process was so that
we could collect and report crash information in a nice way. Not having
that seems like a heavy price to pay.

clang.exe has already a general exception handler (installed by InitLLVM::InitLLVM/sys::PrintStackTraceOnErrorSignal), which prints the callstack in case of a crash. So you wouldn't need the child process to do post-crash processing.

However, when running within a parent clang.exe, the parents calls Driver::generateCompilationDiagnostics() after a crash to create the repro (preprocessed source).

In a scenario where a single clang.exe is running (no child) we could modify Driver::generateCompilationDiagnostics() so it gets called from within the exception handler, and invoke a child process to collect the preprocessed source.

I wrote a fully fledged crash reporting system which does that, so I know that's possible. The tricky thing is to ensure Driver::generateCompilationDiagnostics() doesn't touch potentially invalid structures (if a crash occured) or allocate memory. The way around that is to pre-allocate and pre-copy the structures that you will be accessing in the case of a crash. But I don't see it as a hurdle in this case.

rnk added a comment.Oct 10 2018, 1:31 PM

I wrote a fully fledged crash reporting system which does that, so I know that's possible. The tricky thing is to ensure Driver::generateCompilationDiagnostics() doesn't touch potentially invalid structures (if a crash occured) or allocate memory. The way around that is to pre-allocate and pre-copy the structures that you will be accessing in the case of a crash. But I don't see it as a hurdle in this case.

Our existing stack dumper breaks all kinds of rules in this area. It uses std::vectors to make the stack trace. As much as it pains me that it's not 110% correct, nobody has complained so far that they don't get stack traces from clang when it crashes asynchronously inside of malloc.

Nice, the numbers for avoiding -cc1 are pretty compelling. It would also make debugging easier. I'd love to never see that question again on cfe-dev.

To handle pre-processed source generation, the best idea I could come up with was to pre-allocate the "generate crash diagnostics" command line and then spawn off a child process in the signal/exception handler to the heavy lifting. We're not concerned about generating pre-processed source in cases of system low memory or instability, so the relative unreliability of creating a child process isn't a huge concern.

dmajor added a subscriber: dmajor.Oct 12 2018, 1:42 PM
JDevlieghere resigned from this revision.Dec 17 2018, 1:56 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptFeb 8 2019, 12:43 AM
aganea retitled this revision from RFC: [clang] Multithreaded compilation support to RFC: [clang] Multithreaded compilation support -- NOT FOR SUBMIT.Mar 14 2019, 11:49 AM
aganea removed a reviewer: JDevlieghere.
aganea planned changes to this revision.Mar 14 2019, 11:49 AM

Having applied this patch to the clang latest (which took a bit of doing, and it's possible I messed up along the way) - all working in general, and a roughly 6x speed-up on my i7-7820X. However, PCH support does not seem to work correctly.

Specifically, it looks like when a file is set to build with /Yc (Create PCH), this is not executed serially with respect to the other jobs. Therefore it attempts to compile targets with /Yu (Use PCH) before the PCH has been generated, and fails.

When one or more files are passed with /Yc, PCH generation needs to execute to completion before the other jobs are started.

If I compile without using PCHs, it all works correctly (but clearly there's some performance hit).

Can anyone confirm if they've got /Yc & /Yu working correctly on their system?

Thanks,

Angus.
aganea added a comment.Apr 7 2019, 6:15 AM

That is strange. As you can’t mix /Yc and /Yu on the command line, MSBuild should issue two different calls to clang-cl, one with /Yc and one /Yu /MP. Can you check with ProcMon that commands are issued in the right order?

Dear @aganea,
Do you have any plans to make this PR compatible with trunk? Now MSVC with /MP builds much faster than clang-cl (at least 2x faster for our project)...
Thanks!

That is strange. As you can’t mix /Yc and /Yu on the command line, MSBuild should issue two different calls to clang-cl, one with /Yc and one /Yu /MP. Can you check with ProcMon that commands are issued in the right order?

They're issued in the right order, but the second doesn't wait for the first to complete.

Do you have any plans to make this PR compatible with trunk? Now MSVC with /MP builds much faster than clang-cl (at least 2x faster for our project)...

I'll get back to this after the vacation (somewhere in August)

They're issued in the right order, but the second doesn't wait for the first to complete.

Ok, I thought the wait was handled by MSBuild. If I understand correctly what you're saying, MSBuild issues both commands at the same time, but the second cl.exe /Yu /MP somehow waits for the first cl.exe /Yc to complete? (maybe by checking if the PCH is still open for write?)

They're issued in the right order, but the second doesn't wait for the first to complete.

Ok, I thought the wait was handled by MSBuild. If I understand correctly what you're saying, MSBuild issues both commands at the same time, but the second cl.exe /Yu /MP somehow waits for the first cl.exe /Yc to complete? (maybe by checking if the PCH is still open for write?)

That appears to be the case, yes.

tycho added a subscriber: tycho.Sep 13 2019, 4:33 PM
tycho added a comment.Sep 13 2019, 8:26 PM

I rebased this myself on the release_90 branch and I was pleasantly surprised that I got the merge right on the first try (?!), and it actually works well without any significant changes (other than merge conflict resolutions).

I've run into two problems with it though:

  • Running clang-cl.exe /MP without a number following the argument doesn't seem to automatically pick up the number of available available CPUs. I haven't debugged that yet, but it's probably a trivial fix.
  • Precompiled headers aren't being built properly when using clang-cl.exe /MP under Visual Studio. If I do a clean build with e.g. /MP16, it errors immediately with this:

1>project_pch.cpp
1>CL : error : unable to read PCH file .\Release\.\/project_pch.pch: 'no such file or directory'
1>CL : fatal error : PCH file '.\Release\.\/project_pch.pch' not found: module file not found
1>Done building project "project.vcxproj" -- FAILED.

When running with -v on clang-cl.exe's command line, I can see that it starts doing a compile with -emit-pch and another with -emit-obj, apparently running simultaneously. It looks like the object file needs the PCH to be compiled first, so it fails when it doesn't find the .pch file the other process should be working on. I bumped msbuild.exe's verbosity up to see what it tried to do, and it only invoked clang-cl once, to compile one file (project_pch.cpp) with the "emit precompiled header" arguments on the command line. So this shouldn't be too crazy-tough to fix.

(I think my short-term/wrong workaround for this will be to ignore /MP when an input file is type c++-header. The more accurate solution would be to make the .obj depend on the .pch)


Has there been any further work on this change outside of what's recorded in this thread?

tycho added a comment.EditedSep 13 2019, 8:46 PM

OK, those two problems were actually easy enough to fix.

diff --git a/lib/Driver/Driver.cpp b/lib/Driver/Driver.cpp
index 2f98cf9e04..d449735973 100644
--- a/lib/Driver/Driver.cpp
+++ b/lib/Driver/Driver.cpp
@@ -3241,7 +3241,8 @@ void Driver::BuildActions(Compilation &C, DerivedArgList &Args,
   }
   if (FinalPhase != phases::Preprocess) {
     if (Arg *A = Args.getLastArg(options::OPT__SLASH_MP)) {
-        C.CoresToUse = llvm::hardware_concurrency();
+      C.CoresToUse = llvm::hardware_concurrency();
+      if (A->getNumValues() == 1)
         StringRef(A->getValue()).getAsInteger(10, C.CoresToUse);
     }
   }

And:

diff --git a/lib/Driver/Driver.cpp b/lib/Driver/Driver.cpp
index d449735973..55b9ca511b 100644
--- a/lib/Driver/Driver.cpp
+++ b/lib/Driver/Driver.cpp
@@ -3239,7 +3239,7 @@ void Driver::BuildActions(Compilation &C, DerivedArgList &Args,
     Args.eraseArg(options::OPT__SLASH_Yu);
     YcArg = YuArg = nullptr;
   }
-  if (FinalPhase != phases::Preprocess) {
+  if (!YcArg && FinalPhase != phases::Preprocess) {
     if (Arg *A = Args.getLastArg(options::OPT__SLASH_MP)) {
       C.CoresToUse = llvm::hardware_concurrency();
       if (A->getNumValues() == 1)

The result is a thing of beauty:

lenary added a subscriber: lenary.Sep 14 2019, 5:36 AM
Trass3r added a subscriber: Trass3r.Jan 8 2020, 5:52 AM
tycho added a comment.EditedJan 15 2020, 6:18 PM

I did some local work to make this build and pass almost all tests on Linux as well, not to make use of the multi-process features but just to avoid having a separate "for Windows only" branch, and to ensure tests pass across platforms. Unfortunately my change is not at all acceptable for inclusion because it's extremely Linux-specific, and Unix/Program.inc very explicitly says "only use generic UNIX code here". I use pidfds and a timerfd combined with epoll_wait to implement sys::WaitMany. It works well, but I'd be concerned about how to implement this as smoothly for any other *nix platform. pidfd/timerfd/epoll made it downright easy, but I started off trying to look at options that would work more universally on e.g. Linux *and* the BSDs, but I couldn't find anything that wouldn't require building a Rube-Goldberg machine juggling timers, alarms, signals, etc, etc.

One thing I noticed in this was one test in particular: Clang :: Driver/cl-fallback.c. It seems to expose a small design oversight. Since we don't ever use Command::Execute or the overrides for it, we don't ever fall back to the other command in FallbackCommand::Execute.

Another thing I see as a big problem for this patch is the change in return value for sys::ExecuteAndWait. I think the rationale of using bool would be a good one in a new project: separating execution failure from child process runtime failure makes sense. But changing it now does cause problems if you merge patches that expect the old return values, and the build silently continues (i.e. if (llvm::sys::ExecuteAndWait(...)) is a common construct). I noticed this in particular when I rebased on top of the LLVM master branch, as there were several new sys::ExecuteAndWait callers (e.g. llvm/tools/llvm-reduce). The difference in return type would probably be less dramatic if it didn't invert the logic (i.e. previously 0 == success, but now true means success and 0 != true, so the whole if(ExecuteAndWait thing breaks terribly). It'd actually be better if the function was renamed or took a different set of arguments, just to cause the build to explicitly break so the callers could be addressed.

Anyway, I'd love to see this patch or a successor to it move forward, as we'd like to start building things with LLVM on *all* platforms where I work, and this would greatly help with the issues for the Visual Studio users (which are the majority of the prospective users at my office).

aganea added a comment.EditedJan 15 2020, 7:22 PM

Hi @tycho ! Sorry for not getting back earlier.
I implemented an alternate approach last year, which proved to be better (in terms of build times) than what I proposed in this demo patch. That is, using a thread pool instead of the process pool as implemented here. This makes compilation with /MP considerably faster, and allows for neat optimizations like file caching, re-using allocated memory pages, and possibly other things down the line. But before getting there, it requires somehow making cl::opt and the CommandLineParser thread-safe, thus revive this RFC thread. In essence, we want a mode where cl::opts have their data optionally stored in a stack context, so that we can safely call a CC1Command in each thread. I'll write a proposal to revive the RFC.

ychen added a subscriber: ychen.Jan 15 2020, 10:00 PM
lenary removed a subscriber: lenary.Jan 28 2020, 8:30 AM
lenary added a subscriber: lenary.
lenary removed a subscriber: lenary.
yaxunl added a subscriber: yaxunl.Mar 5 2020, 11:25 AM