This is an archive of the discontinued LLVM Phabricator instance.

Let clang driver support parallel jobs
Needs ReviewPublic

Authored by yaxunl on Oct 29 2019, 1:26 PM.

Details

Summary

It is observed that device code compilation takes most of the compilation time when
clang compiles CUDA/HIP programs since device code usually contains complicated
computation code. Often times such code are highly coupled, which results in
a few large source files which become bottlenecks of a whole project. Things become
worse when such code is compiled with multiple gpu archs, since clang compiles for
each gpu arch sequentially. In practice, it is common to compile for more than 5 gpu
archs.

To alleviate this issue, this patch implements a simple scheduler which let clang
driver compile independent jobs in parallel.

This patch tries to minimize impact on existing clang driver. No changes to action
builder and tool chain. It introduces a driver option -parallel-jobs=n to control number
of parallel jobs to launch. By default it is 1, and it is NFC per clang driver behavior.
If llvm/clang is built with LLVM_ENABLE_THREADS off, this change is also NFC.

The basic design of the scheduler is to find the dependence among the jobs and
use a thread to launches a job when its dependent jobs are done.

Diff Detail

Event Timeline

yaxunl created this revision.Oct 29 2019, 1:26 PM
tra added a subscriber: echristo.

@echristo Eric, any thoughts/concerns on the overall direction for the driver?

@yaxunl One concern I have is diagnostics. When the jobs are executed in parallel, I assume all of them will go to the standard error and will be interleaved randomly across all parallel compilations. Figuring out what went wrong will be hard. Ideally we may want to collect output from individual sub-commands and print them once the job has finished, so there's no confusion about the source of the error.

It is observed that device code compilation takes most of the compilation time when
clang compiles CUDA/HIP programs since device code usually contains complicated
computation code. Often times such code are highly coupled, which results in
a few large source files which become bottlenecks of a whole project. Things become
worse when such code is compiled with multiple gpu archs, since clang compiles for
each gpu arch sequentially. In practice, it is common to compile for more than 5 gpu
archs.

I think this change will only help with relatively small local builds with few relatively large CUDA/HIP files. We did talk internally about parallelizing CUDA builds in the past and came to the conclusion that it's not very useful in practice, at least for us. We have a lot of non-CUDA files to compile, too, and that usually provides enough work for the build to hide the long CUDA compilations. Distributed builds (and I guess local, too) often assume one compilation per CPU, so launching multiple parallel subcompilations for each top-level job may be not that helpful in practice beyond manual compilation of one file. That said, the change will be a nice improvement for quick rebuilds where only one/few CUDA files need to be recompiled. However, in that case being able to get comprehensible error messages would also be very important.

Overall I'm on the fence about this change. It may be more trouble than it's worth.

clang/include/clang/Driver/Job.h
77

Nit: Using pointer as a key will result in sub-compilations being done in different order from run to run and that may result in build results changing from run to run.

I can't think of a realistic scenario yet. One case where it may make a difference is generation of dependency file.
We currently leak some output file name flags to device-side compilations. E.g. -fsyntax-only -MD -MF foo.d will write foo.d for each compilation. At best we'll end up with the result of whichever sub-compilation finished last. At worst we'll end up with corrupt output. In this case it's the output argument leak that's the problem, but I suspect there may be other cases where execution order will be observable.

clang/lib/Driver/Compilation.cpp
284–286

Indentation seems to be off. Run through clang-format?

thakis added a subscriber: thakis.Oct 30 2019, 11:15 AM
thakis added inline comments.
clang/lib/Driver/Compilation.cpp
303

Maybe a select() / fork() loop is a better approach than spawning one thread per subprocess? This is doing thread-level parallelism in addition to process-level parallelism :)

If llvm doesn't have a subprocess pool abstraction yet, ninja's is pretty small, self-contained, battle-tested and open-source, maybe you could copy that over (and remove bits you don't need)?

https://github.com/ninja-build/ninja/blob/master/src/subprocess.h
https://github.com/ninja-build/ninja/blob/master/src/subprocess-win32.cc
https://github.com/ninja-build/ninja/blob/master/src/subprocess-posix.cc

+@aganea @amccarth
Users have been asking for /MP support in clang-cl for a while, which is basically this.

Is there anything in JobScheduler that could reasonably be moved down to llvm/lib/Support? I would also like to be able to use it to implement multi-process ThinLTO instead of multi-threaded ThinLTO.

This is somehow similar to what I was proposing in D52193.

Would you possibly provide tests and/or an example of your usage please?

clang/lib/Driver/Compilation.cpp
303

@thakis How would this new Subprocess interface with the existing llvm/include/llvm/Support/Program.h APIs? Wouldn't be better to simply extend what is already there with a WaitMany() and a Terminate() API like I was suggesting in D52193? That would cover all that's needed. Or are you suggesting to further stub ExecuteAndWait() by this new Subprocess API?

332

In addition to what @thakis said above, yielding here is maybe not a good idea. This causes the process to spin, and remain in the OS' active process list, which uselessly eats cpu cycles. This can become significant over the course of several minutes of compilation.

Here's a tiny example of what happens when threads are waiting for something to happen:
(the top parts yields frequently; the bottom part does not yield - see D68820)

You would need here to go through a OS primitive that suspends the process until at least one job in the pool completes. On Windows this can be achieved through WaitForMultipleObjects() or I/O completion ports like provided by @thakis. You can take a look at Compilation::executeJobs() in D52193 and further down the line, WaitMany() which waits for at least one job/process to complete.

354

It's a waste to start a new thread here just because ExecuteAndWait() is used inside Command::Execute(). An async mechanism would be a lot better like stated above.

tycho added a subscriber: tycho.Nov 20 2019, 10:21 AM
yaxunl updated this revision to Diff 232405.Dec 5 2019, 11:07 AM

split the patch

Trass3r added a subscriber: Trass3r.Jan 8 2020, 5:52 AM
yaxunl marked an inline comment as done.Feb 4 2020, 1:23 PM
yaxunl added inline comments.
clang/lib/Driver/Compilation.cpp
332

Sorry for the delay.

If D52193 is commited, I will probably only need some minor change to support parallel compilation for HIP. Therefore I hope D52193 could get committed soon.

I am wondering what is the current status of D52193 and what is blocking it. Is there any chance to get it commited soon?

Thanks.

aganea added inline comments.Feb 5 2020, 6:14 AM
clang/lib/Driver/Compilation.cpp
332

Hi @yaxunl! Nothing prevents from finishing D52193 :-) It was meant as a prototype, but I could transform it into a more desirable state.
I left it aside because we made another (unpublished) prototype, where the process invocations were in fact collapsed into the calling process, ie. ran in a thread pool in the manner of the recent -fintegrated-cc1 flag. But that requires for cl::opt to support different contexts, as opposed to just one global state (an RFC was discussed about a year ago, but there was no consensus).
Having a thread pool instead of the process pool is faster when compiling .C/.CPP files with clang-cl /MP, but perhaps in your case that won't work, you need to call external binaries, do you? Binaries that are not part of LLVM? If so, then landing D52193 first would makes sense.

yaxunl marked an inline comment as done.Feb 5 2020, 7:22 AM
yaxunl added inline comments.
clang/lib/Driver/Compilation.cpp
332

HIP toolchain needs to launch executables other than clang for a compilation, therefore D52193 is more relevant to us. I believe this is also the case for CUDA, OpenMP and probably more general situations involving linker. I think both parallel by threads and parallel by processes are useful. However parallel by processes is probably more generic. Therefore landing D52193 first would benefit a lot.

ashi1 added a subscriber: ashi1.Aug 24 2020, 12:32 PM