Page MenuHomePhabricator

[Clang][Driver] Re-use the calling process instead of creating a new process for the cc1 invocation
Needs ReviewPublic

Authored by aganea on Nov 4 2019, 1:56 PM.

Details

Summary

This patch is an optimization for speed: whenever possible, it will avoid creating a new process for the -cc1 invocation, and instead will call the cc1 tool inside the calling process (clang tool).
On Windows, this has a moderate impact on build times (please see timings in the comments below)

CFG (control flow guard) is disabled on Windows.

This patch also mildly improves build times on a Haswell 6-core Linux PC.

@Meinersbur reported no change in timings on a many-core Linux machine.


If you'd like to try this on your configuration, use the script below to ensure a standardized unit of measure:
(this is only for Windows)

Warning: the script pulls and reverts all local changes!
Please edit the script before running, to provide the right filename for applying the patch.

The script can be used such as:

# run the test (lengthy, hours!)
> powershell .\abba_test.ps1

By default, it'll do a 2-stage build, including cmake'ing with a fixed set of options; followed by AB/BA testing where it'll alternatively rebuild using Clang 10 with and without the patch, for at least 5 hours. If you want to fiddle the number of hours it runs, you can do:

> powershell .\abba_test.ps1 50

Where '50' represents the number of hours it'll run. You can also skip building the two stages, if you already did it, by setting the second parameter:

> powershell .\abba_test.ps1 50 $True

You then end up with something like that:

Total iterations:  6
     |         Min       |        Mean       |       Median      |         Max       |
   A |  00:11:55.8588259 |  00:12:11.5531275 |  00:12:09.5867878 |  00:12:30.2468040 |
   B |  00:09:35.1556958 |  00:09:52.5302980 |  00:09:55.2329476 |  00:10:04.4994028 |
Diff | -00:02:20.7031301 | -00:02:19.0228296 | -00:02:14.3538402 | -00:02:25.7474012 |

If you find this useful, I can maybe convert it to a python script and send a review separately.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
aganea updated this revision to Diff 227779.Nov 4 2019, 2:05 PM

Fix missing llvm::InitializeAllTargets() in driver.cpp

aganea marked 3 inline comments as done.Nov 4 2019, 2:15 PM

A few remarks:

clang/lib/Driver/Job.cpp
318

See my other comment about LLVM_THREAD_LOCAL

clang/tools/driver/driver.cpp
358

LLVM_THREAD_LOCAL is here for the llvm-buildozer application I've presented at the LLVM conference. It's there because main() is entered from several threads, thus all clang needs to be thread-safe (I have several other patches to achieve that).
If you want this patch without LLVM_THREAD_LOCAL I'm fine with that.

llvm/lib/Support/Windows/Signals.inc
814 ↗(On Diff #227779)

When this function is entered through the global exception handler, returning EXCEPTION_EXECUTE_HANDLER lets the CRT do its own exit with the exception code.

This patch reduced the build time from 12 to 7 minutes? Sounds too good to be true. Thanks for the work! I would primarily interested in this to avoid determining the -cc1 command line when debugging.

However, the implementation of calling it's own main function gives the impression to be more like a hack. How much restructuring would be needed to properly create a CompilerInstance inside the driver instead?

clang/lib/Driver/Job.cpp
368–383

This looks fragile as it will break when the user chooses to rename the executable (clang-cuda, --driver-mode=...,...). Why not moving the ClangDriverMain logic into the clangDriver module. Or, at least, pass it through a lambda. It would also make it easier to use clangDriver as a library (if that was ever an intended use-case?!?)

clang/tools/driver/driver.cpp
357

Could you add a comment for the purpose of this function? It's not obvious outside the context of this patch.

Meinersbur added inline comments.Nov 4 2019, 4:28 PM
clang/lib/Driver/Job.cpp
368–383

I tried this on Linux and it did not work out-of-the box (ClangDriverMain being nullptr) due to the executable's name being clang-10. After fixing, check-clang failed with 5 errors in the Driver test.

The performance benefit was within noise: 6m20s vs. 6m18s on a 2-Socket 28 cores-per-processor (112 SMT threads) Skylake-Xeon system for ninja all check-clang, assertions enabled.

Thanks for sending this out!

Instead of the dynamic lookup of that symbol, what do you think about passing in the function via a normal api? That way, the type checker and linker help us keep things working; dynamic lookup is always a bit subtle and hard to grep for. (llvm-cs wouldn't know about the current use, for example.) See https://reviews.llvm.org/D69848 for a rough sketch.

For robust crash handling, I figured on non-win we'd have a signal handler that on signal self-execs with a "actually, do use a cc1 process" and then use the existing crash machinery, but maybe that's not needed.

aganea added a comment.Nov 5 2019, 8:09 AM

Thanks for the feedback @Meinersbur!

This patch is mainly geared towards Windows users. I'm not expecting anything on Linux, you already have all the bells & whistles there :-) Although I definitely see improvements on my Linux box. Would the distro make a difference? Mine is:

$ uname -a
Linux (name edited) 5.0.0-29-generic #31~18.04.1-Ubuntu SMP Thu Sep 12 18:29:21 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

This patch reduced the build time from 12 to 7 minutes? Sounds too good to be true.

Starting up and cooling down a large process like clang on Windows is very expensive. A fair amount of cpu time is spent in the clang driver, essentially in the kernel, in the range of 30 - 100 ms per process:
(timings below are before this patch)

We don't have fork() on Windows meaning that the kernel needs to re-start a the cc1 process from scratch, allocate pages, remap the EXE, allocate heap, allocate TLS, start the CRT, etc. Also, InitLLVM::InitLLVM is expensive because it calls into dbghelp.dll and loads symbols - just that can sometimes take up to 10 ms per process (depending on the system load).
In addition, recent Windows builds have all sorts of kernel regressions related to process creation and virtual pages allocation. Bruce Dawson has several blog entries about all this.
We're simply mitigating these issues. Upgrading the 36-core server to Windows 10 build 1903 will probably decrease the gap (12 min -> 7 min). However I would still expect users of Clang on pre-1903 builds for a few years from now.

I will fix the other issues you've mentioned.

aganea added a comment.Nov 5 2019, 8:31 AM

Thanks for sending this out!

Instead of the dynamic lookup of that symbol, what do you think about passing in the function via a normal api? That way, the type checker and linker help us keep things working; dynamic lookup is always a bit subtle and hard to grep for. (llvm-cs wouldn't know about the current use, for example.) See https://reviews.llvm.org/D69848 for a rough sketch.

Absolutely, it make more sense, I can't remember why I used the dynamic lookup :-( I have this patch lying around since last year.

For robust crash handling, I figured on non-win we'd have a signal handler that on signal self-execs with a "actually, do use a cc1 process" and then use the existing crash machinery, but maybe that's not needed.

Right now on Windows I'm doing about that, not in the exception handler, but a bit later on in clang/lib/Driver/Driver.cpp, L1338. The worse that can happen would be a memory corruption, which would crash Driver::generateCompilationDiagnostics() and prevent displaying the preprocessed file and the cmd-line. But you would still see the callstack (which is rendered now in the SEH, see llvm/lib/Support/CrashRecoveryContext.cpp, L194 - I still need to hook that up for the posix implementation). Ideally we could render some of these things in advance (or pre-allocate memory at least) to minimize the risk of a crash when displaying crash diagnostics.

aganea updated this revision to Diff 228729.EditedNov 11 2019, 10:42 AM
aganea marked 2 inline comments as done.

Addressed comments & finished the Linux part. All tests pass.

@Meinersbur : Just to show the difference between Windows & Linux, here's some timings for running the tests over the same git checkout, compiled with clang-9.
Everything was already built, I used ninja check-all and only the tests ran.
LLVM_ENABLE_PROJECTS = "clang;lld;clang-tools-extra;compiler-rt"

6-core Intel(R) Xeon(R) CPU E5-1650 0 @ 3.20GHzUbuntu 18.04770.07s (12 min 50 sec)59182 tests ran
6-core Intel(R) Xeon(R) W-2135 CPU @ 3.70GHzWindows 10 build 19031779.83s (29 min 39 sec)56589 tests ran

This patch does not make much difference for running the tests on Windows (it's saving only 30 sec). The friction in the OS is so high, improving just clang.exe does not change much.

hans added a comment.Nov 12 2019, 7:13 AM

Thanks for sharing this!

I still need to take a proper look, but my first thought is this:

Instead of invoking main() (or a similar function like ClangDriverMain) as an alternative to spinning up the cc1 process, would it be possible to call ExecuteCC1Tool() directly? I guess it would be a little less general, but maybe it would be more straight-forward?

Meinersbur added inline comments.Nov 12 2019, 10:27 AM
clang/lib/Driver/Job.cpp
376

Why is this check necessary? Why not assuming that if Driver::Main is set, it will be the right one?

Does this change crash recovery semantics in any meaningful way? Will we still be able to get stack traces on all platforms when the compiler crashes?

llvm/lib/Support/CrashRecoveryContext.cpp
207 ↗(On Diff #228729)

You can fix this by writing:

static bool wrap_function_call(function_ref<void()> Fn, bool EnableExceptionHandler, unsigned &RetCode)
{
   __try {
     Fn();
     return true;
  } __except (EnableExceptionHandler
                  ? LLVMUnhandledExceptionFilter(GetExceptionInformation())
                  : 1) {
    RetCode = GetExceptionCode();
    return false;
  }
}

bool CrashRecoveryContext::RunSafely(function_ref<void()> Fn) {
    ...
    bool Result = wrap_function_call(EnableExceptionHandler, Fn, RetCode);
    ...
}
aganea marked an inline comment as done.EditedNov 12 2019, 11:55 AM

Instead of invoking main() (or a similar function like ClangDriverMain) as an alternative to spinning up the cc1 process, would it be possible to call ExecuteCC1Tool() directly? I guess it would be a little less general, but maybe it would be more straight-forward?

Possibly, yes, I'll have to check. There might be a few edge cases where we re-enter ClangDriverMain() twice if I recall correctly - I'll check.

Does this change crash recovery semantics in any meaningful way? Will we still be able to get stack traces on all platforms when the compiler crashes?

Yes, the changes in CrashRecoveryContext allow the same side-effects as the global exception handler. You would see the callstack and minidump in the same way as before.

There's still a minimal risk of memory/heap/CRT corruption, even with CrashRecoveryContext (after a crash). But that would possibly affect LLVMUnhandledExceptionFilter as well, and right now (without this patch), that handler runs in the context of the crashed process (not the calling one) so it wouldn't be worse than what it is now.

I could write a follow-up patch to prepare on startup the cmd-line invoked by Driver::generateCompilationDiagnostics(), instead of preparing after a crash. We could also pre-allocate a few virtual pages on advance, and use that in a BumpAllocator, instead of allocating after the crash. We could also merge the feature of llvm-symbolizer into clang.exe, so that we could call-back into clang.exe to render the callstack on stdout, even if llvm-symbolizer is not present.

The only drawback I see now is that we don't get the coredump on Linux, because the program doesn't end through the kernel signal handler. However, given that @Meinersbur says he sees no improvement on his side, we could disable this optimization on non-win? (or hide it behind a disabled flag on non-win)

clang/lib/Driver/Job.cpp
376

The driver builds phases that do not always call the cc1 process. Simply stating clang a.cpp would invoke clang -cc1, then the linker. In the later case, even if we have Driver::Main it doesn't mean we should use it. There are a number of other edge cases of the same kind, such as /fallback or build cuda files.

aganea updated this revision to Diff 229333.Nov 14 2019, 9:36 AM
aganea marked an inline comment as done.

Call into ExecuteCC1Tool() directly, as suggested by @hans
Add more comments.
Remove pThis in CrashRecoveryContext.cpp:RunSafely() as suggested by @zturner

hans added a comment.Nov 15 2019, 1:45 AM

Thanks for the update! I think calling ExecuteCC1Tool instead of main is an improvement (there's no need to think about "re-entering" the process, it's really just a function call). I still don't see why it needs to go through a function pointer. I was hoping the code could be something like:

In Command::Execute:
if the tool we're going to execute is cc1 (not the gen-reproducer though),
then instead of forking, call ExecuteCC1Tool with the args directly (inside a context so that we can recover from a crash)

Do you think that's achievable?

clang/include/clang/Driver/Driver.h
209

Why do we need a variable for this? Couldn't the code just invoke the function directly?

aganea added a comment.EditedNov 15 2019, 6:32 AM

@hans : Simply because ExecuteCC1Tool() lives in the clang tool (clang/tools/driver/driver.cpp) and we're calling it from the clangDriver.lib (clang/lib/Driver/Job.cpp). The clangDriver.lib is linked into many other tools (clang-refactor, clang-rename, clang-diff, clang-check, libclang, ...) If I reference the cc1 function directly, we end of with linker errors in all those tools.
We could maybe move the tool code into the clangDriver.lib, but I'm not sure it's practical as the tool pulls on many other libraries. I thought the callback was the path of least resistance. Let me know if we could do it otherwise.

hans added a comment.Mon, Nov 18, 5:33 AM

@hans : Simply because ExecuteCC1Tool() lives in the clang tool (clang/tools/driver/driver.cpp) and we're calling it from the clangDriver.lib (clang/lib/Driver/Job.cpp). The clangDriver.lib is linked into many other tools (clang-refactor, clang-rename, clang-diff, clang-check, libclang, ...) If I reference the cc1 function directly, we end of with linker errors in all those tools.
We could maybe move the tool code into the clangDriver.lib, but I'm not sure it's practical as the tool pulls on many other libraries. I thought the callback was the path of least resistance. Let me know if we could do it otherwise.

Oh, I see. I didn't think about that. That's annoying, but maybe using the variable makes sense then.

clang/include/clang/Driver/Driver.h
207

It's not really a callback though. How about just "Pointer to the ExecuteCC1Tool function, if available."
It would also be good if the comment explained why the pointer is needed.
And why does it need to be thread-local? Can different threads end up with different values for this?

clang/include/clang/Driver/Job.h
136

Is there another way of avoiding in-process-cc1 after a crash? For example, could the Job know that it's being built for generating the crash report? It seems unfortunate to introduce this widely-scoped variable here.

clang/lib/Driver/Job.cpp
367–368

I don't think this form of initialization is common in LLVM code.

375

Now that we're not calling main() anymore, but cc1 directly -- is this checking still necessary?

clang/tools/driver/driver.cpp
325

It's not really re-entering. Maybe "If we call the cc1 tool directly in the driver process" is clearer?

llvm/include/llvm/Support/CrashRecoveryContext.h
105 ↗(On Diff #229333)

I don't think this form of initialization is common in LLVM. Same below.

llvm/lib/Support/CrashRecoveryContext.cpp
23 ↗(On Diff #229333)

Can we move this declaration, and the one for SignalHandler, into some header file? Declaring it manually like this doesn't seem so nice.

aganea updated this revision to Diff 230143.Tue, Nov 19, 2:11 PM
aganea marked 10 inline comments as done.
aganea edited the summary of this revision. (Show Details)

Addressed @hans' comments.

I've also added a script in the summary, if you'd like to reproduce the results on your end, and to ensure we all run the same thing.

aganea added inline comments.Tue, Nov 19, 2:14 PM
clang/lib/Driver/Job.cpp
375

Yes it is - see my other comment just above.

The driver builds phases that do not always call the cc1 process. Simply stating clang a.cpp would invoke clang -cc1, then the linker, say link.exe. In this later case (invoking link.exe), even if we have Driver::Main it doesn't mean we should use it. There are a number of other edge cases of the same kind, such as /fallback or building cuda files, where a different executable is invoked along the way."

aganea retitled this revision from [Clang][Driver] Bypass cc1 process and re-enter driver main to [Clang][Driver] Re-use the calling process instead of creating a new process for the cc1 invocation.Tue, Nov 19, 2:18 PM
aganea edited the summary of this revision. (Show Details)
ormris added a subscriber: ormris.Tue, Nov 19, 2:27 PM
hans added inline comments.Wed, Nov 20, 6:21 AM
clang/lib/Driver/Job.cpp
367–368

I think it would be clearer with a boolean "CallCC1" or something instead of a function pointer.

375

Okay, but couldn't we find a cleaner way of figuring out that the Command is a cc1 invocation? The thing that creates the Command would know.. maybe it could set a flag or something? Or maybe instead of creating a Command it would create something else?

aganea updated this revision to Diff 230527.Thu, Nov 21, 1:47 PM
aganea marked 3 inline comments as done.
aganea added inline comments.
clang/lib/Driver/Job.cpp
375

Yeah you're right, I've refactored and added a CC1Command instead, makes things much cleaner, thank you!

I've split the CrashRecoveryContext changes into another patch to minimize the risks, please see D70568. That patch will need to go first before this one.

russell.gallop added a comment.EditedFri, Nov 22, 6:03 AM

Thanks for this change. I applied this patch (prior to splitting out https://reviews.llvm.org/D70568) and it built and worked okay (I only see one clang-cl in process explorer).

I don't see anything like same performance improvement however. I did my own assessment of build times. I built 2 x stage 1 toolchains using clang 9.0.0 with and without this patch (applied to 1b9ef3bbb595206b0097b7adec2c1b69eae6fab4). Then I used each of those to build LLVM (no clang or lld) again.

CMake settings:

# Build stage 1 (same with and without patch)
cmake -G Ninja -DCMAKE_C_COMPILER="C:/Program Files/LLVM/bin/clang-cl.exe" -DCMAKE_CXX_COMPILER="C:/Program Files/LLVM/bin/clang-cl.exe" -DCMAKE_LINKER="C:/Program Files/LLVM/bin/lld-link.exe" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS=clang;lld -DLLVM_TARGETS_TO_BUILD=X86 -DCLANG_ENABLE_STATIC_ANALYZER=OFF -DCLANG_ENABLE_ARCMT=OFF ../llvm

# Build stage 2 without patch
cmake -G Ninja -DCMAKE_C_COMPILER="<llvm-project.git>/build_stage1_no_patch/bin/clang-cl.exe" -DCMAKE_CXX_COMPILER="<llvm-project.git>/build_stage1_no_patch/bin/clang-cl.exe" -DCMAKE_LINKER="<llvm-project.git>/build_stage1_no_patch/bin/lld-link.exe" -DCMAKE_BUILD_TYPE=Release -DLLVM_TARGETS_TO_BUILD=X86 ../llvm

# Build stage 2 with patch
cmake -G Ninja -DCMAKE_C_COMPILER="<llvm-project.git>/build_stage1_patch/bin/clang-cl.exe" -DCMAKE_CXX_COMPILER="<llvm-project.git>/build_stage1_patch/bin/clang-cl.exe" -DCMAKE_LINKER="<llvm-project.git>/build_stage1_patch/bin/lld-link.exe" -DCMAKE_BUILD_TYPE=Release -DLLVM_TARGETS_TO_BUILD=X86 ../llvm

These both built the patched code.

I got a ~1.5% performance improvement, reducing "ninja all" from 571.5s to 562.5s seconds on a 6 core machine (average over three 3 builds on a quiet machine), times below. For reference I am using winver 1803 and have CFG enabled.

Assuming that time_saved ≈ (process_invocation_overhead * compiler_jobs) / parallel_build_jobs
or (rearranging) process_invocation_overhead ≈ (time_saved * parallel_build_jobs) / compiler_jobs
(I know this is not perfectly true as build jobs will tail off towards the end of a build, but I think is okay for a ballpark estimate.)

I saved 9 seconds, over 1720 compiler jobs (from ninja -v -n all | grep -c clang-cl), with 14 parallel build jobs which gives the process invocation overhead of about 73ms which is in the range you mentioned above (30-100ms). Therefore I think that this is in the right ballpark.

I can't see how this would get 20% (from the table) or 30% improvement (from the first graph). In abba_test.ps1 you include running tests (due to check-all) so it is possible that you are saving time on a very large number of clang driver invocations on very small files. This is helpful for LLVM developers but I don't think it's representative of other builds. Alternatively, is the process overhead higher on your machine(s) for some reason (e.g. security software).

To be clear, I'm not against this patch going in, and I can confirm that it is a performance improvement so good to have. I just can't see where the claimed 20-30% saving on build time comes from.

Time data for 3 builds (seconds)
No patch (571.349, 575.353, 567.638)
With Patch (560.870, 563.110, 563.368)

aganea edited the summary of this revision. (Show Details)Mon, Nov 25, 8:53 AM

Thanks for the feedback Russell!

Can you possibly try again with /MT? (ie. -DLLVM_USE_CRT_RELEASE=MT)

In the abba_test.ps1 script, ninja check-all is only used for preparing clang.exe with the patch. The A/B loop does not use check-all.

I've modified the abba_test.ps1 script to only build llvm/, with the same options as you do. There was also a bug in the timespan calculation, the test would last a lot more than you specified.

I also found out that on some of our multi-socket servers, Ninja defaulted to only the # threads from the first socket. If you had a 2x 18-core system, Ninja would default to 38 threads (18 x 2 + 2) instead of 74 threads (2 x 18 x 2 + 2). This behavior seems to be random, depending on the system. This has been fixed by Ninja PR 1674, I've tested all our systems with this patch and they now all default to the proper # of threads.

I'm doing a new round of tests, I'll get back with updated figures.

hans added a comment.Tue, Nov 26, 2:23 AM

Thanks! This is starting to look really good now.

clang/lib/Driver/Job.cpp
318

Maybe better to just return a bool for success/failure since there is not really a process exit code yet.

(But I'm not even sure this is worth a separate method; see comment further down.)

426

For CC1Command, PrepareExecution doesn't do very much (since it doesn't use response files). Because of that, I'm not sure breaking out PrepareExecution into a separate method is worth it. I'd just copy the part that applies to CC1Command here. Then we wouldn't need to check the return value from PrepareExecution (it can't fail if response files aren't used), and we also don't need to intercept SetResponsefile -- we'd just ignore it even if it's set.

455

llvm_unreachable("msg") is usually preferred instead of assert(0 && "msg")

clang/tools/driver/driver.cpp
330

This feels a little risky to me. Do we still need to rely on argv[1], now that we have a nice CC1Command abstraction?

Thanks for the feedback Russell!

Can you possibly try again with /MT? (ie. -DLLVM_USE_CRT_RELEASE=MT)

I tried adding this to my stage 1 builds and it didn't make any significant difference (times in seconds):

No patch (573.879, 568.519, 567.709)
With Patch (578.253, 558.515, 562.368)

In case it makes a difference, this is from a VS2017 command prompt with Visual Studio 15.9.17.

In the abba_test.ps1 script, ninja check-all is only used for preparing clang.exe with the patch. The A/B loop does not use check-all.

Ah, yes. Sorry, my mistake.

I also found out that on some of our multi-socket servers, Ninja defaulted to only the # threads from the first socket. If you had a 2x 18-core system, Ninja would default to 38 threads (18 x 2 + 2) instead of 74 threads (2 x 18 x 2 + 2). This behavior seems to be random, depending on the system. This has been fixed by Ninja PR 1674, I've tested all our systems with this patch and they now all default to the proper # of threads.

I'm not testing on a multi-socket system so don't believe I need that patch.

I'm doing a new round of tests, I'll get back with updated figures.

Thanks.

I ran the new version of abba_test.ps1 (just building LLVM) with small modifications to comment out the "Validate Ninja" step and change "check-all" to "all" to save time as I'm concentrating on build build speed. This means that I'm applying it to the same git revision as you. For clang_bypass_cc1.patch I used the patch I downloaded before (before spinning off D70568). Running for one hour on a six core machine (winver 1803, 14 ninja jobs, CFG enabled, recently rebooted) I get:

Total iterations:  2

     |         Min       |        Mean       |       Median      |         Max       |
   A |  00:15:13.0773389 |  00:15:13.7219710 |  00:15:13.7219710 |  00:15:14.3666030 |
   B |  00:15:08.6134425 |  00:15:09.0363284 |  00:15:09.0363284 |  00:15:09.4592142 |
Diff | -00:00:04.4638964 | -00:00:04.6856426 | -00:00:04.6856426 | -00:00:04.9073888 |

So in this case it saved 0.5% of time. Using the previous maths, with 2378 clang-cl jobs, this implies process creation time of 29ms. This was fairly soon after a reboot. Maybe I'm just lucky and none of the process creation problems are affecting me on this system (at this moment).

So in this case it saved 0.5% of time. Using the previous maths, with 2378 clang-cl jobs, this implies process creation time of 29ms. This was fairly soon after a reboot. Maybe I'm just lucky and none of the process creation problems are affecting me on this system (at this moment).

It looks like the git apply didn't work, but the script continued so this was a duff experiment, please ignore. I'll try to fix this and run it again.

It looks like the git apply didn't work, but the script continued so this was a duff experiment, please ignore. I'll try to fix this and run it again.

Thanks Russell! Can you try running the script overnight? It'll give a better result and we'll see the error bar.

Maybe I'm just lucky and none of the process creation problems are affecting me on this system (at this moment).

I'm running tests on 6 differents Windows systems (both desktops and servers), with different OS revisions, and they all behave differently. With this patch, some see better improvements than others. I'll post the results once I have a better picture.

It looks like the git apply didn't work, but the script continued so this was a duff experiment, please ignore. I'll try to fix this and run it again.

Note for other reviewers, the abba_test.ps1 script uses a git format patch, not the format you can download from Phabricator.

Thanks Russell! Can you try running the script overnight? It'll give a better result and we'll see the error bar.

Okay. With the patch sorted, I ran for 10 hours overnight.

Total iterations:  21

     |         Min       |        Mean       |       Median      |         Max       |
   A |  00:14:11.0340636 |  00:14:33.2494597 |  00:14:25.3644475 |  00:15:25.4082274 |
   B |  00:13:57.9445995 |  00:14:20.3073502 |  00:14:09.0937819 |  00:16:06.2885457 |
Diff | -00:00:13.0894641 | -00:00:12.9421096 | -00:00:16.2706656 |  00:00:40.8803183 |

From looking at the individual times this was stable for a couple of hours then there were some particularly long runs (on A and B), possibly some scheduled maintenance tasks, so I'd recommend using the median value to ignore these outliers. Based on the median this saves about 1.9% for me.

aganea added a comment.EditedMon, Dec 2, 3:19 PM

You're right indeed Russell, my bad, this gain is not as important as I initially claimed -- however there's a gain.

Several factors came into play in my previous testings:

  1. The latest ninja 1.9.0 doesn't use all cpu sockets on some Windows systems (see this), and as it uses +2 cores in addition to the detected # of cores, this possibly induces some unwanted side-effects (such as the speed-up seen in the summary).
  2. I was using Clang 9.0 RC4, which is compiled with -DLLVM_ENABLE_ASSERTIONS=ON. When @hans sent the email for 9.0.1 I realized this flag is turned off only for the final release :-(
  3. I was using all sorts of options (/MT /GS- /arch:AVX) which actually improved build times, when compared with the baseline Clang release (see graphs below)

We indeed have a security Windows service running on our machines which we cannot disable.

I have re-run the tests on 6 different configurations (see graph for specs)
When using this patch along with -GNinja -DLLVM_ENABLE_PROJECTS=clang;lld;llvm -DLLVM_USE_CRT_RELEASE=MT -DCMAKE_CXX_FLAGS="/arch:AVX /GS-" -DCMAKE_C_FLAGS="/arch:AVX /GS-", it reduces build times by about 30 sec on many-cores machines, and about 15-20 sec on 6-cores. The 28-core machine did not see any improvements, I cannot explain why.
The variability is also reduced when this patch is applied. This seems to be consistent across all machines.

WARNING: the X coordinate (time) does not start at zero

In my sense, the patch remains valid overall:

  • It is easier to debug with just one process.
  • Later on, I'd like to put forward a patch which (optionally) replaces the CRT allocator with rpmalloc, and not having this current patch would have less impact.
  • I have an alternate proposal for D52193 and D69582 which uses a thread pool instead of a process pool, and this patch is needed for that purpose (however that requires reviving & solving the thread-safe cl::opt RFC).
aganea edited the summary of this revision. (Show Details)Mon, Dec 2, 3:25 PM

Thanks for the updated timings. I have no objection to this going in.

I haven't gone through the code changes in detail so I think you should probably get approval from someone who has (such as @hans).

hans added a comment.Tue, Dec 3, 6:11 AM

I'm also supported, even though the gains are smaller than first thought. It depends on D70568 though.

@aganea Please disable the new behavior for the Darwin platform. We rely on the fact that Clang -cc1 processes crash to report crashes using system's crash reporting infrastructure.

aganea added a comment.Wed, Dec 4, 6:55 AM

@aganea Please disable the new behavior for the Darwin platform. We rely on the fact that Clang -cc1 processes crash to report crashes using system's crash reporting infrastructure.

Will do of course. Just for my own curiosity, how do you actually disable the LLVM signal handler when you ship clang to customers? Through the env var LLVM_DISABLE_CRASH_REPORT? Or maybe this is disabled by default in your downstream fork? Does SignalHandler() get called?

Should I disable it through a #ifndef __APPLE__? Or something else?

@aganea Please disable the new behavior for the Darwin platform. We rely on the fact that Clang -cc1 processes crash to report crashes using system's crash reporting infrastructure.

We'll want to use this in chromium on Mac. There should probably be a cmake def to pick the default driver mode for this (and an explicit flag to pick it at runtime). We can then set that cmake def.

I guess Duncan's comments on https://reviews.llvm.org/D70568 were for this patch. I think it would be good to have:

  • A cmake option that controls the default behavior. It should still fork for Darwin by default, the detection should be done through CMake when it's building for a Darwin platform.
  • A driver/cc1 flag that allows the user to override the behavior.

This will allow users to avoid cc1 forking either by building their own Clang, or by passing in a flag with their build options. WDYT?

I guess Duncan's comments on https://reviews.llvm.org/D70568 were for this patch. I think it would be good to have:

That's right, I was confused about which review I was on.

  • A cmake option that controls the default behavior. It should still fork for Darwin by default, the detection should be done through CMake when it's building for a Darwin platform.
  • A driver/cc1 flag that allows the user to override the behavior.
  • I don't think a -cc1 flag is reasonable, because it affects the caller, not -cc1 itself. I would be confused by this I think.
  • Driver flag seems a bit too user-facing, but probably fine.
  • Another (maybe odd) option would be an environment variable, such as CLANG_FORK_CC1=1 or CLANG_FORK_CC1=0, which overrides the default.

This will allow users to avoid cc1 forking either by building their own Clang, or by passing in a flag with their build options. WDYT?

Right. Some way to control at runtime, especially so we can test both ways of launching in check-clang but also probably useful for other scenarios, and a cmake flag to set the default so vendors can choose.

aganea added a comment.EditedThu, Dec 5, 11:28 AM

I did this in driver.cpp:

// Whether the cc1 tool should be called inside the current process, or forked
// to a new new process.
bool UseCC1ProcessFork = CLANG_FORK_CC1;

StringRef ForkCC1Str = ::getenv("CLANG_FORK_CC1");
if (!ForkCC1Str.empty()) {
  UseCC1ProcessFork = ForkCC1Str.compare("1") == 0 ||
                      ForkCC1Str.compare_lower("on") == 0 ||
                      ForkCC1Str.compare_lower("true") == 0;
}
if (!UseCC1ProcessFork) {
  // Here we provide a shortcut for calling the -cc1 cmd-line within the
  // same process, instead of starting a new process. This saves a some
  // execution time of Windows, as process creation can be expensive on
  // that platform.
  TheDriver.CC1Main = &ExecuteCC1Tool;
}

With a new option in cmake to control the default behavior:

set(CLANG_FORK_CC1 OFF BOOL
    "Whether clang should use a new process for the CC1 invocation")

if (APPLE)
  set(CLANG_FORK_CC1 1)
endif()

And the corresponding flag in config.h:

/* If we want to fork a new process clang.exe for the CC1 invocation */
#cmakedefine01 CLANG_FORK_CC1

This option needs to be controlled at large (otherwise some TUs might compile with forking "on" and others with forking "off", depending on what happens in the build scripts), so I thought the env variable would make more sense. Do you think a new cmd-line option is worth it, or is the env var enough?

The environment variable should be enough IMHO

rnk added a comment.Thu, Dec 5, 1:14 PM

I'd be fine with the cmake var and the env var and no driver flag.

I think clang developers on Darwin probably want the single-process clang -c behavior (that's how I interpreted what Duncan said), even though Apple as a distributor of clang wants the multi-process behavior. I would skip the platform check and in upstream clang and suggest that Apple customize the CMake variable when producing the XCode tools package. But, that's just my opinion, please disagree if you don't like that idea.

Lastly, we could make the environment variable more cross-platform by calling it CLANG_SPAWN_CC1 to be inclusive of OSs where fork doesn't exist. Besides, doesn't clang use posix_spawn on Mac anyway?

aganea updated this revision to Diff 232861.Mon, Dec 9, 7:49 AM
aganea marked 4 inline comments as done.

Addressed all comments.
Added CLANG_SPAWN_CC1 cmake flag & environment variable to control whether we want a cc1 process or not. It is disabled by default. I left out the if (APPLE) part in cmake, as suggested by @rnk.

aganea added inline comments.Mon, Dec 9, 7:49 AM
clang/tools/driver/driver.cpp
330

Risky in what way? What would you suggest? The contents of argv are exactly the same in both cases: when called from the CC1Command, or explicitly from the command-line.

hans marked an inline comment as done.Wed, Dec 11, 6:32 AM

I'm basically happy with this, just some minor comments.

clang/lib/Driver/Job.cpp
376

I think this gets printed by "clang -###" so maybe just print something short here. Perhaps just an "(in-process)" prefix? Then we don't need to change so many test expectations also.

394

Do we need to disable afterwards?

clang/tools/driver/driver.cpp
267

Maybe just do the "!!" thing like for the environment variables above? It's not pretty, but at least that would be consistent, and it avoids the problem of deciding what values mean "on" and what mean "off".

330

You're right, I suppose the code was already doing this.

niosHD added a subscriber: niosHD.Wed, Dec 11, 6:37 AM